# Movie Recommendation System with Autoencoders and Feature Engineering

## Introduction

This notebook walks through the steps of building a movie recommendation system. We use a combination of traditional machine learning, feature engineering, and deep learning to handle missing values and predict similarities between movies. The final goal is to recommend movies that are similar based on embeddings learned from an autoencoder model. We will break down the steps from preprocessing to building the model in a way that is both informative and easy to follow.

## Steps

### 1. **Loading and Exploring the Dataset**
   - **Objective:** Read the movie dataset (`n_movies.csv`) and explore its structure.
   - **Why:** Understanding the data is critical to making informed preprocessing and feature engineering decisions.

### 2. **Handling Missing and Duplicate Values**
   - **Objective:** Clean the data by filling missing values and removing duplicates.
   - **Why:** Missing data and duplicates can cause inconsistencies in training machine learning models and impact prediction quality.

### 3. **Feature Engineering**
   - **Objective:** Create new features based on existing columns such as title, description, stars, and genre.
   - **Why:** By combining different text features, we allow the model to learn more meaningful patterns and improve its ability to recommend similar movies.

### 4. **Encoding Categorical Features**
   - **Objective:** Use `LabelEncoder` for categorical features like 'certificate' and multi-hot encoding for genres.
   - **Why:** Machine learning algorithms cannot handle categorical data directly, so encoding these features as numerical values is necessary.

### 5. **Text Feature Representation Using TF-IDF and HashingVectorizer**
   - **Objective:** Convert text data (titles, descriptions, and stars) into numerical vectors using `TfidfVectorizer` and `HashingVectorizer`.
   - **Why:** Text data must be converted into numerical features for machine learning models. TF-IDF is a popular method for capturing the importance of words in documents.

### 6. **Predicting Missing Certificates**
   - **Objective:** Train a RandomForestClassifier to predict missing 'certificate' values.
   - **Why:** Handling missing values in the certificate field helps us build a cleaner dataset for downstream tasks.

### 7. **Multi-Target Regression for Predicting Ratings and Votes**
   - **Objective:** Use Ridge regression to predict missing 'rating' and 'votes' values.
   - **Why:** Filling in missing ratings and votes allows us to create a complete dataset, which improves model training for recommendation tasks.

### 8. **Autoencoder for Learning Movie Embeddings**
   - **Objective:** Build an autoencoder to learn a low-dimensional embedding of the movie features.
   - **Why:** Autoencoders are powerful neural networks that can learn compressed representations (embeddings) of input data, capturing the most important features for similarity-based recommendations.

### 9. **Cosine Similarity for Movie Recommendations**
   - **Objective:** Compute the cosine similarity between the learned embeddings to find similar movies.
   - **Why:** Cosine similarity measures the similarity between two vectors, making it an ideal metric for recommendation systems.

### 10. **Retrieving Similar Movies**
   - **Objective:** Create a function to retrieve the top 10 most similar movies based on cosine similarity.
   - **Why:** This function will allow us to generate recommendations based on the learned embeddings.

## Conclusion
In this notebook, we combined traditional machine learning and deep learning techniques to build a movie recommendation system. Through careful feature engineering and model building, we were able to predict missing values and provide recommendations based on similarities in learned embeddings. This approach can be extended to other recommendation tasks with minimal modification.


In [1]:
import pandas as pd 
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
import numpy as np 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from tensorflow.keras import Model
import tensorflow as tf
from tensorflow.keras import layers, models, Model
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
from sklearn.preprocessing import LabelEncoder

### Let's read in the dataframe 

In [2]:
df = pd.read_csv('n_movies.csv')

### Using `.info()` method enabbles us to understand our datatypes across columns as well as missing values 

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9957 entries, 0 to 9956
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        9957 non-null   object 
 1   year         9430 non-null   object 
 2   certificate  6504 non-null   object 
 3   duration     7921 non-null   object 
 4   genre        9884 non-null   object 
 5   rating       8784 non-null   float64
 6   description  9957 non-null   object 
 7   stars        9957 non-null   object 
 8   votes        8784 non-null   object 
dtypes: float64(1), object(8)
memory usage: 700.2+ KB


### Using the `.head` method on a df will return 5 rows which enables us to get an idea for what our data looks like 

In [4]:
df.head()

Unnamed: 0,title,year,certificate,duration,genre,rating,description,stars,votes
0,Cobra Kai,(2018– ),TV-14,30 min,"Action, Comedy, Drama",8.5,Decades after their 1984 All Valley Karate Tou...,"['Ralph Macchio, ', 'William Zabka, ', 'Courtn...",177031
1,The Crown,(2016– ),TV-MA,58 min,"Biography, Drama, History",8.7,Follows the political rivalries and romance of...,"['Claire Foy, ', 'Olivia Colman, ', 'Imelda St...",199885
2,Better Call Saul,(2015–2022),TV-MA,46 min,"Crime, Drama",8.9,The trials and tribulations of criminal lawyer...,"['Bob Odenkirk, ', 'Rhea Seehorn, ', 'Jonathan...",501384
3,Devil in Ohio,(2022),TV-MA,356 min,"Drama, Horror, Mystery",5.9,When a psychiatrist shelters a mysterious cult...,"['Emily Deschanel, ', 'Sam Jaeger, ', 'Gerardo...",9773
4,Cyberpunk: Edgerunners,(2022– ),TV-MA,24 min,"Animation, Action, Adventure",8.6,A Street Kid trying to survive in a technology...,"['Zach Aguilar, ', 'Kenichiro Ohashi, ', 'Emi ...",15413


#### Things to note about the data above 
1. stars is stored in list format
2. certificate is missing many values
3. rating is almost complete but not fully complete 

#### Let's see if our titles/stars are unique 

##### In this case they are not - we have duplicate titles which will throw off our learning process

In [5]:
print(len(df.title.unique())) 

7912


In [6]:
df = df.drop_duplicates(subset=['title'])

##### Our stars are TOO unique given that they are stored in lists ... that isn't good and we will have to clean this up 

In [7]:
print(len(df.stars.unique()))

7460


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7912 entries, 0 to 9912
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        7912 non-null   object 
 1   year         7410 non-null   object 
 2   certificate  4737 non-null   object 
 3   duration     6459 non-null   object 
 4   genre        7848 non-null   object 
 5   rating       6872 non-null   float64
 6   description  7912 non-null   object 
 7   stars        7912 non-null   object 
 8   votes        6872 non-null   object 
dtypes: float64(1), object(8)
memory usage: 618.1+ KB


#### our Year column is weird. I think we will ignore the value later on.

# Filling certificates with RandomForestClassifier (handling Nulls)

### Lets map votes from object dtype to float

In [9]:
df['votes'] = df['votes'].str.replace(',', '').astype(float)


### 1. Handling Missing Values

To prepare the data, we first fill in missing values in the `genre`, `certificate`, and `description` columns. Missing values in categorical columns like `genre` and `certificate` are replaced with "Unknown," while missing descriptions are filled with "No description available." This ensures that all rows have complete data for training the model.

In [10]:
# Fill missing values for 'genre', 'certificate', and 'description'
df['genre'] = df['genre'].fillna('Unknown')
df['certificate'] = df['certificate'].fillna('Unknown')
df['description'] = df['description'].fillna('No description available')

### 2. Feature Engineering: Creating Text Features

We create a new feature, `text_features`, by combining important text-based columns: `title`, `description`, `stars`, and `genre`. This helps to encapsulate all key movie information into one feature, which will be used to train the model. This feature allows the model to learn from multiple aspects of a movie, such as its title, description, cast, and genre.


In [11]:
# Create text features from title, description, stars, and genre
df['text_features'] = df['title'] + ' ' + df['description'] + ' ' + df['stars'] + ' ' + df['genre']
df['text_features'] = df['text_features'].fillna('')

### 3. Encoding Categorical Labels

Since machine learning models require numerical input, we encode the `certificate` column using a `LabelEncoder`. This converts the categorical certificate labels into numerical values, making them usable for training a classification model.

In [12]:
# Convert 'certificate' to string and encode it
le = LabelEncoder()
df['certificate'] = df['certificate'].astype(str)
df['certificate_encoded'] = le.fit_transform(df['certificate'])

### 4. Excluding 'Unknown' Certificates for Training

To ensure that the model learns only from valid certificate labels, we exclude rows where the `certificate` is labeled as "Unknown." The model will be trained on movies with known certificates to avoid learning any biases from the placeholder label "Unknown."


In [13]:
# Exclude rows with 'Unknown' certificates for training
df_known = df[df['certificate'] != 'Unknown']
df_unknown = df[df['certificate'] == 'Unknown']

### 5. Vectorizing Text Data

We use `TfidfVectorizer` to convert the `text_features` into numerical vectors that capture the importance of each word in the dataset. This representation of the text features is then used as the input for our model.

In [14]:
# Fit the TfidfVectorizer on the known data
tfidf = TfidfVectorizer(stop_words='english', max_features=30000)
X_known = tfidf.fit_transform(df_known['text_features'])

### 6. Training the RandomForestClassifier

We train a `RandomForestClassifier` on the vectorized text features of movies with known certificates. The model learns to classify movies based on the textual information present in their titles, descriptions, stars, and genres.

In [15]:
# Target variable is the encoded certificate (excluding 'Unknown')
y_known = df_known['certificate_encoded']

# Split the known data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_known, y_known, test_size=0.3, random_state=42)

# Train the RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

### 7. Predicting Missing Certificates

After training, we use the model to predict the certificates for movies that had "Unknown" certificates. These predictions are then used to fill in the missing certificate values.

In [16]:
# Predict on the test set
y_pred = model.predict(X_test)

# Get the unique labels in the test set (excluding 'Unknown')
unique_labels = np.unique(y_test)

# Print the classification report for known certificates
print(classification_report(y_test, y_pred, labels=unique_labels, target_names=le.classes_[unique_labels]))

              precision    recall  f1-score   support

    Approved       0.00      0.00      0.00        11
           G       0.00      0.00      0.00        12
           M       0.00      0.00      0.00         1
   Not Rated       0.63      0.13      0.22       131
          PG       0.62      0.21      0.31        48
       PG-13       0.73      0.09      0.16        87
      Passed       0.00      0.00      0.00         2
           R       0.67      0.04      0.08       145
       TV-14       0.39      0.10      0.16       247
        TV-G       0.00      0.00      0.00        34
       TV-MA       0.38      0.98      0.55       489
       TV-PG       0.50      0.01      0.02        99
        TV-Y       0.57      0.54      0.56        37
       TV-Y7       0.55      0.12      0.19        52
    TV-Y7-FV       0.00      0.00      0.00         8
     Unrated       0.00      0.00      0.00        19

    accuracy                           0.40      1422
   macro avg       0.31   

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



### 8. Updating the Dataset

Finally, the dataset is updated with the predicted certificates, replacing the "Unknown" values with the model's predictions. This completes the process of handling missing certificates and preparing the dataset for further analysis or use.

In [17]:
# Now predict the missing certificates for 'Unknown' rows
if not df_unknown.empty:
    X_unknown = tfidf.transform(df_unknown['text_features'])
    df_unknown['certificate_predicted'] = le.inverse_transform(model.predict(X_unknown))

    # Overwrite the 'Unknown' certificates in the original DataFrame with the predictions
    df.loc[df['certificate'] == 'Unknown', 'certificate'] = df_unknown['certificate_predicted']

# Check how many 'Unknown' certificates remain after the update
print(f"Number of rows with 'Unknown' certificate after update: {df[df['certificate'] == 'Unknown'].shape[0]}")

Number of rows with 'Unknown' certificate after update: 0


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_unknown['certificate_predicted'] = le.inverse_transform(model.predict(X_unknown))


In [18]:
df.certificate.unique()

array(['TV-14', 'TV-MA', 'NC-17', 'R', 'PG-13', 'TV-PG', 'PG',
       'Not Rated', 'TV-Y7-FV', 'Approved', 'G', 'TV-Y7', 'Unrated',
       'TV-G', 'MA-17', 'TV-Y', 'Passed', '12', 'M', 'E10+'], dtype=object)

# Data Preprocessing and Regression Model for Missing Ratings and Votes

## Overview

W handle missing data and apply machine learning models to predict missing values for movie ratings and votes. The process involves several key steps:



### 1. Removing the Certificate Encoded Column

We start by removing the previously encoded `certificate_encoded` column as it is no longer needed. This step ensures that we only work with relevant features in the dataset.

---

In [19]:
df.drop('certificate_encoded', inplace=True, axis = 1)

df.info()

backup = df.copy()

<class 'pandas.core.frame.DataFrame'>
Index: 7912 entries, 0 to 9912
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   title          7912 non-null   object 
 1   year           7410 non-null   object 
 2   certificate    7912 non-null   object 
 3   duration       6459 non-null   object 
 4   genre          7912 non-null   object 
 5   rating         6872 non-null   float64
 6   description    7912 non-null   object 
 7   stars          7912 non-null   object 
 8   votes          6872 non-null   float64
 9   text_features  7912 non-null   object 
dtypes: float64(2), object(8)
memory usage: 679.9+ KB


### 2. Multi-Hot Encoding for Genres

The `genre` column contains a list of genres for each movie. To encode this data in a way that the model can understand, we:
- Split the `genre` column into individual genres.
- Created a unique list of genres and applied multi-hot encoding. This allows each genre to have its own binary column, indicating the presence or absence of that genre for each movie.

---

In [20]:
# Multi-hot encoding for 'genre' column
df['genre'] = df['genre'].str.split(', ')

df['genre'].head()

# Create a unique list of all genres
unique_genres = set([genre for sublist in df['genre'] for genre in sublist])

unique_genres

# Initialize a DataFrame for multi-hot encoding
for genre in unique_genres:
    df[genre] = df['genre'].apply(lambda x: int(genre in x))

df.head()

Unnamed: 0,title,year,certificate,duration,genre,rating,description,stars,votes,text_features,...,Adventure,Reality-TV,Music,Thriller,Documentary,Comedy,Crime,Mystery,Romance,Film-Noir
0,Cobra Kai,(2018– ),TV-14,30 min,"[Action, Comedy, Drama]",8.5,Decades after their 1984 All Valley Karate Tou...,"['Ralph Macchio, ', 'William Zabka, ', 'Courtn...",177031.0,Cobra Kai Decades after their 1984 All Valley ...,...,0,0,0,0,0,1,0,0,0,0
1,The Crown,(2016– ),TV-MA,58 min,"[Biography, Drama, History]",8.7,Follows the political rivalries and romance of...,"['Claire Foy, ', 'Olivia Colman, ', 'Imelda St...",199885.0,The Crown Follows the political rivalries and ...,...,0,0,0,0,0,0,0,0,0,0
2,Better Call Saul,(2015–2022),TV-MA,46 min,"[Crime, Drama]",8.9,The trials and tribulations of criminal lawyer...,"['Bob Odenkirk, ', 'Rhea Seehorn, ', 'Jonathan...",501384.0,Better Call Saul The trials and tribulations o...,...,0,0,0,0,0,0,1,0,0,0
3,Devil in Ohio,(2022),TV-MA,356 min,"[Drama, Horror, Mystery]",5.9,When a psychiatrist shelters a mysterious cult...,"['Emily Deschanel, ', 'Sam Jaeger, ', 'Gerardo...",9773.0,Devil in Ohio When a psychiatrist shelters a m...,...,0,0,0,0,0,0,0,1,0,0
4,Cyberpunk: Edgerunners,(2022– ),TV-MA,24 min,"[Animation, Action, Adventure]",8.6,A Street Kid trying to survive in a technology...,"['Zach Aguilar, ', 'Kenichiro Ohashi, ', 'Emi ...",15413.0,Cyberpunk: Edgerunners A Street Kid trying to ...,...,1,0,0,0,0,0,0,0,0,0


### 3. Encoding the Stars Column Using Hashing Vectorizer

We convert the list of stars (actors) into a single string per movie. Using the `HashingVectorizer`, we encode the `stars` column into a fixed number of dimensions. This is an efficient method to deal with high-dimensional data like movie stars.

---

In [21]:
# Convert the 'stars' list into a single string per row
df['stars'] = df['stars'].apply(lambda x: ' '.join(eval(x)))

df['stars'].iloc[0]

# Use HashingVectorizer to encode stars into a fixed number of dimensions
vectorizer = HashingVectorizer(n_features=100, alternate_sign=False)

# Fit and transform the 'stars' column
stars_encoded = vectorizer.fit_transform(df['stars'])

# Convert to a DataFrame (optional, as HashingVectorizer doesn't have feature names)
stars_encoded_df = pd.DataFrame(stars_encoded.toarray())

df = pd.concat([df.reset_index(drop=True), stars_encoded_df], axis=1)

### 4. Vocabulary Size Check for Titles and Descriptions

We applied `TfidfVectorizer` to both the `title` and `description` columns. To determine the optimal number of features, we first checked the vocabulary size (number of unique words) for both columns. This helped guide our decision on how many features to use in the final TF-IDF representation.

---

In [22]:
# Without max_features, fit on full title and description to check vocab size
tfidf_title_full = TfidfVectorizer(stop_words='english')
tfidf_description_full = TfidfVectorizer(stop_words='english')

# Fit to extract the vocabulary
tfidf_title_full.fit(df['title'])
tfidf_description_full.fit(df['description'])

# Check vocabulary sizes
title_vocab_size = len(tfidf_title_full.vocabulary_)
description_vocab_size = len(tfidf_description_full.vocabulary_)

print(f"Number of unique words in titles: {title_vocab_size}")
print(f"Number of unique words in descriptions: {description_vocab_size}")

Number of unique words in titles: 8425
Number of unique words in descriptions: 19567


### 5. Applying TF-IDF to Titles and Descriptions

Based on the vocabulary size, we limited the number of features to:
- 3500 for titles.
- 13000 for descriptions.

These limits ensure that we capture enough relevant information without introducing too many irrelevant or sparse features.

---


In [23]:

# Apply TfidfVectorizer to 'title' and 'description'
tfidf_title = TfidfVectorizer(max_features=3500)  # Limit the number of features
tfidf_description = TfidfVectorizer(max_features=13000)

# Fit and transform the 'title' and 'description'
title_tfidf = tfidf_title.fit_transform(df['title'])
description_tfidf = tfidf_description.fit_transform(df['description'])

### 6. Concatenating Encoded Features and Dropping Original Columns

After encoding the `title` and `description` columns using TF-IDF, we concatenated these results back into the main DataFrame. The original `title` and `description` columns were then dropped, as they had already been encoded.

---

In [24]:

# Convert the TF-IDF results into DataFrames
title_df = pd.DataFrame(title_tfidf.toarray(), columns=tfidf_title.get_feature_names_out())
description_df = pd.DataFrame(description_tfidf.toarray(), columns=tfidf_description.get_feature_names_out())

# Concatenate back to the original DataFrame
df = pd.concat([df.reset_index(drop=True), title_df, description_df], axis=1)


# Drop the original 'title' and 'description' columns since they've been encoded
df = df.drop(columns=['title', 'description'])


### 7. Label Encoding the Certificate Column

The `certificate` column was label encoded to convert categorical values into numerical values. This step allows us to use the certificate data for any future modeling.

---

In [25]:
le = LabelEncoder()
df['certificate_encoded'] = le.fit_transform(df['certificate'])

# Drop the original 'certificate' column
df = df.drop(columns=['certificate'])

df.drop(['text_features','year','duration'], inplace = True, axis = 1)

df.drop(['stars','genre'], axis=1, inplace=True)

### 8. Handling Missing Values for Ratings and Votes

We split the dataset into two parts:
- **Training Data**: Rows where both `rating` and `votes` are present.
- **Missing Data**: Rows where either `rating` or `votes` are missing.

This split allows us to train regression models on the complete data and predict the missing values.

---

In [26]:
train_data = df.dropna(subset=['rating', 'votes'])

# Rows where 'rating' or 'votes' are null (for prediction)
missing_data = df[df['rating'].isnull() | df['votes'].isnull()]

# Separate the features (X) and the targets (y)
X_train = train_data.drop(columns=['rating', 'votes'])
X_train.columns = X_train.columns.astype(str)
y_train_rating = train_data['rating']
y_train_votes = train_data['votes']

# The features for missing rows
X_missing = missing_data.drop(columns=['rating', 'votes'])

### 9. Train-Test Split for Rating and Votes Prediction

For both `rating` and `votes`, we performed a train-test split to train the models on 70% of the data and test their performance on 30%.

---

In [27]:
# Train-test split for rating
X_train_rating, X_test_rating, y_train_rating, y_test_rating = train_test_split(X_train, y_train_rating, test_size=0.3, random_state=42)

# Train-test split for votes
X_train_votes, X_test_votes, y_train_votes, y_test_votes = train_test_split(X_train, y_train_votes, test_size=0.3, random_state=42)
# Convert column names to strings
X_train_rating.columns = X_train_rating.columns.astype(str)
X_test_rating.columns = X_test_rating.columns.astype(str)

In [28]:
# Check for any non-numeric columns and convert or drop them
non_numeric_columns = X_train_rating.select_dtypes(exclude=['number']).columns
print(f"Non-numeric columns: {non_numeric_columns}")

Non-numeric columns: Index([], dtype='object')



### 10. Ridge Regression for Predicting Ratings and Votes

We used Ridge regression to predict both `rating` and `votes`:
- **Rating Model**: Trained on the complete data where both `rating` and `votes` are present.
- **Votes Model**: Similarly trained on complete data.

These models were then used to predict missing `rating` and `votes` values for the movies where these were unavailable.

---

In [29]:
rating_model = Ridge()
rating_model.fit(X_train_rating, y_train_rating)

votes_model = Ridge()
votes_model.fit(X_train_votes, y_train_votes)

# Identify rows where 'rating' and 'votes' are missing
missing_data_rating = df[df['rating'].isnull()]
missing_data_votes = df[df['votes'].isnull()]

# Features for rows with missing ratings and votes
X_missing_rating = missing_data_rating.drop(columns=['rating', 'votes'])
X_missing_votes = missing_data_votes.drop(columns=['rating', 'votes'])

# Make sure column names are strings for these subsets too
X_missing_rating.columns = X_missing_rating.columns.astype(str)
X_missing_votes.columns = X_missing_votes.columns.astype(str)

predicted_ratings = rating_model.predict(X_missing_rating)
predicted_votes = votes_model.predict(X_missing_votes)

### 11. Filling Missing Ratings and Votes

Once the models predicted the missing values, we updated the original DataFrame by replacing the missing `rating` and `votes` values with the predictions.

---

In [30]:

# Fill the missing 'rating' values with the predicted ratings
df.loc[df['rating'].isnull(), 'rating'] = predicted_ratings

# Fill the missing 'votes' values with the predicted votes
df.loc[df['votes'].isnull(), 'votes'] = predicted_votes

# Autoencoder for Movie Embeddings and Similarity-Based Recommendations


Finally, we focus on building an autoencoder to generate movie embeddings and use those embeddings to find the most similar movies in the dataset. The autoencoder compresses the movie features into a lower-dimensional space and then reconstructs them. By capturing key information in the compressed representation (the embeddings), we can compute similarities between movies and recommend the most similar ones.


### 1. Data Preparation

Before training the autoencoder, we first prepare the data:

- **Standardization**: To ensure that all features contribute equally to the model, we apply a `StandardScaler` to normalize the data. This is essential because neural networks, like the autoencoder, perform better when the input features are on the same scale.

---

In [31]:
df.columns = df.columns.astype(str)

# Initialize the scaler
scaler = StandardScaler()

# Fit and transform the DataFrame
scaled_df = scaler.fit_transform(df)

### 2. Autoencoder Architecture

We design an autoencoder that consists of both an encoder and a decoder:

- **Input Layer**: Takes in the standardized movie data.
- **Encoder**:
  - The first two layers progressively reduce the dimensionality of the data from 64 to 32, and then to a 16-dimensional **embedding layer**. This embedding layer captures the compressed representation of each movie.
  
- **Decoder**:
  - The subsequent layers expand the compressed data back to its original dimensionality (the same number of features as the input).
  
This structure allows the model to learn a compact, useful representation of each movie.

---

In [32]:
# Define the autoencoder model architecture with explicit Input layer
input_layer = layers.Input(shape=(scaled_df.shape[1],))

# Hidden layers (compressing down to 16-dimensional embedding)
x = layers.Dense(64, activation='relu')(input_layer)
x = layers.Dense(32, activation='relu')(x)
embedding_layer = layers.Dense(16, activation='relu')(x)  # Embedding layer

# Expanding back to the original 1129 features
x = layers.Dense(32, activation='relu')(embedding_layer)
x = layers.Dense(64, activation='relu')(x)
output_layer = layers.Dense(scaled_df.shape[1], activation=None)(x)  # Output layer with the same shape as input



### 3. Model Training

The autoencoder is trained in a self-supervised manner, meaning that the target is the input itself. The goal is for the model to reconstruct the input data as accurately as possible after compressing and decompressing it.

- **Loss Function**: We use Mean Squared Error (`mse`) to measure the difference between the original and reconstructed data.
- **Optimizer**: We use the `adam` optimizer to update the model’s weights during training.
- **Training**: The model is trained for 15 epochs, with a batch size of 64, allowing the network to learn meaningful embeddings.

---


In [33]:
# Define the model
model = Model(inputs=input_layer, outputs=output_layer)

# Compile the model
model.compile(optimizer='adam', loss='mse')

# Train the model (self-supervised, learning embeddings for similarity)
model.fit(scaled_df, scaled_df, epochs=15, batch_size=64)

Epoch 1/15
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 11ms/step - loss: 1.0013
Epoch 2/15
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - loss: 1.0040
Epoch 3/15
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - loss: 0.9924
Epoch 4/15
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step - loss: 1.0017
Epoch 5/15
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step - loss: 0.9947
Epoch 6/15
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step - loss: 0.9993
Epoch 7/15
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - loss: 0.9946
Epoch 8/15
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 11ms/step - loss: 1.0036
Epoch 9/15
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step - loss: 0.9971
Epoch 10/15
[1m124/124[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms

<keras.src.callbacks.history.History at 0x2827bc740>

### 4. Generating Movie Embeddings

Once the autoencoder is trained, we extract the compressed 16-dimensional embeddings from the **middle layer** of the autoencoder. These embeddings serve as the key feature representations for each movie and capture important information about the movie's characteristics in a much smaller space.

---

In [34]:
#Create a new model to output the embeddings from the middle layer
embedding_model = Model(inputs=model.input, outputs=model.layers[3].output)  # Layer 3 is the embedding layer

# Generate embeddings for all movies in the dataset
movie_embeddings = embedding_model.predict(scaled_df)

# Get the embeddings (output of the middle layer)
embedding_model = models.Model(inputs=model.input, outputs=model.layers[3].output)  # Layer 3 is the embedding layer
movie_embeddings = embedding_model.predict(scaled_df)

[1m248/248[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step  
[1m248/248[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 934us/step


### 5. Computing Movie Similarities

To compute the similarity between movies, we use **cosine similarity** on the embeddings. Cosine similarity measures how close two vectors are in the embedding space, providing a way to quantify the similarity between movies based on their compressed representations.

---

In [35]:
# Compute the cosine similarity between the embeddings
similarity_matrix = cosine_similarity(movie_embeddings)

# Function to get the top 10 most similar movies
def get_similar_movies(movie_index, similarity_matrix, top_n=10):
    # Get the similarities for the target movie
    similarities = similarity_matrix[movie_index]
    
    # Sort movies based on similarity (excluding the movie itself)
    similar_movies = np.argsort(similarities)[::-1][1:top_n+1]
    
    return similar_movies

### 6. Movie Recommendations

Using the similarity matrix, we can identify the top 10 most similar movies for any given movie in the dataset. By sorting the similarity scores in descending order (excluding the movie itself), we retrieve a list of the most similar movies.

- **Top N Similar Movies**: For a given movie, we output the top 10 most similar movies based on the embeddings. This allows us to generate meaningful recommendations by leveraging the learned features from the autoencoder.

---

In [36]:
# Get the top 10 most similar movies for movie at index 10
similar_movies_indices = get_similar_movies(20, similarity_matrix, top_n=10)

print(f"Most similar to {backup.title.iloc[20]}")
print(backup.iloc[20])
for i in similar_movies_indices:
  
    print('#----------')
    print(backup.iloc[i])

Most similar to The Lord of the Rings: The Fellowship of the Ring
title            The Lord of the Rings: The Fellowship of the Ring
year                                                        (2001)
certificate                                                  PG-13
duration                                                   178 min
genre                                     Action, Adventure, Drama
rating                                                         8.8
description      A meek Hobbit from the Shire and eight compani...
stars            ['Peter Jackson', '| ', '    Stars:', 'Elijah ...
votes                                                    1844055.0
text_features    The Lord of the Rings: The Fellowship of the R...
Name: 20, dtype: object
#----------
title                                                     #BlackAF
year                                                        (2020)
certificate                                                  TV-MA
duration                   