In [None]:
import pandas as pd
import numpy as np  
import os


# Autoencoder for Collaborative Filtering (PyTorch)

**NOTE**: The code below is for installing PyTorch with our specific GPU support

Needs to be addapted on a different environment

-Cuda compilation tools, release 12.5, V12.5.40

-Build cuda_12.5.r12.5/compiler.34177558_0

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121


### Autoencoders

Autoencoders are neural networks that are trained to reconstruct their input data. They are especially powerful for tasks such as dimensionality reduction, anomaly detection, and recommendation systems. 

In the context of collaborative filtering, Autoencoders can:
- Learn a **low-dimensional representation** of users and items (hotels and channels) in the hidden layers.
- Reconstruct the interactions between users (hotels) and items (channels), allowing us to predict the interactions that have not been seen before.
- Provide recommendations based on these reconstructions.

Autoencoders help in capturing **non-linear patterns** in the data, which may not be possible with simpler linear models like SVD.

### How Autoencoders Work

An Autoencoder is composed of two main components:
1. **Encoder**: This part compresses the input data into a lower-dimensional space (also known as the bottleneck layer or latent space).
2. **Decoder**: This part reconstructs the input data from the compressed representation produced by the encoder.

The architecture can be visualized as follows:

$$
\text{Input Data} \xrightarrow{\text{Encoder}} \text{Latent Representation} \xrightarrow{\text{Decoder}} \text{Reconstructed Data}
$$

The goal of training an Autoencoder is to minimize the difference between the **input data** and the **reconstructed data**. This is achieved through **backpropagation** and an optimization process (e.g., Adam optimizer) using a loss function (e.g., Mean Squared Error).

### Mathematical Formulation

Let the input data matrix be $ X \in \mathbb{R}^{m \times n} $, where $ m $ is the number of hotels (users) and $ n $ is the number of channels (items). The objective of the Autoencoder is to learn an encoding function $ f: \mathbb{R}^n \rightarrow \mathbb{R}^k $ that maps the input data to a lower-dimensional space of size $ k $, and a decoding function $ g: \mathbb{R}^k \rightarrow \mathbb{R}^n $ that maps the latent representation back to the original space.

The Autoencoder is trained by minimizing the **reconstruction error**:

$$
\mathcal{L}(X, \hat{X}) = \frac{1}{m} \sum_{i=1}^{m} \| X_i - \hat{X}_i \|_2^2
$$

where:
- $ X_i $ is the original interaction vector for hotel $ i $,
- $ \hat{X}_i $ is the reconstructed interaction vector for hotel $ i $,
- $ \|\cdot\|_2 $ denotes the Euclidean norm.

This loss function is optimized to minimize the difference between the original and reconstructed data, ensuring that the model captures the patterns in the data.

### Model Architecture

### Encoder

The **encoder** is a neural network that compresses the input data into a latent representation. In our case, the input data is a one-hot encoded vector representing the interactions of a hotel with all available channels. The encoder has two layers:
- A fully connected layer that reduces the dimensionality from the input size $ n $ (number of channels) to a smaller hidden dimension.
- Another fully connected layer that reduces the representation further to half of the previous hidden size, providing a compact latent space.

Mathematically, for an input vector $ \mathbf{x} $, the encoder function is defined as:

$$
\mathbf{h} = f(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)
$$

where $ \mathbf{h} $ is the latent representation, $ \mathbf{W}_1 $ and $ \mathbf{b}_1 $ are the weights and bias of the first layer, and $ f $ is an activation function (ReLU).

### Decoder

The **decoder** reconstructs the input data from the latent representation. It consists of two fully connected layers:
- The first layer expands the latent space back to a larger dimension.
- The second layer reconstructs the original data.

Mathematically, the decoder function is defined as:

$$
\hat{\mathbf{x}} = g(\mathbf{W}_2 \mathbf{h} + \mathbf{b}_2)
$$

where $ \hat{\mathbf{x}} $ is the reconstructed data, $ \mathbf{W}_2 $ and $ \mathbf{b}_2 $ are the weights and bias of the second layer, and $ g $ is typically a Sigmoid activation function to ensure the output is between 0 and 1 (appropriate for binary data, like hotel-channel interactions).

### Training

The model is trained by feeding in the input interaction matrix (hotel-channel interaction data), which consists of binary values: 1 if a hotel has interacted with a channel, and 0 if not. The network tries to predict the missing values in the interaction matrix (i.e., recommending new channels).

The training objective is to minimize the **Mean Squared Error** (MSE) between the original input and the reconstructed data. Once trained, the model can be used to predict the interaction matrix for hotels and generate recommendations based on the predicted values.

In [None]:
# Importing the data 

# Information about individual channels
data_lake_prd_314410_cz_canais = pd.read_csv('../data/lookups/data-lake-prd-314410.cz.canais.csv')

# List of hotel-channel combinations as of January 2025
hotel_city_chanel_combin_extract  = pd.read_csv('../data/other/hotel_city_chanel_combin_extract.csv')
hotel_city_chanel_combin_extract.dropna(inplace=True)
hotel_city_chanel_combin_extract.drop(columns=['Cidade_ID'], inplace=True)
hotel_city_chanel_combin_extract.drop_duplicates(inplace=True)

In [None]:
import torch

print(torch.__version__)  # Check PyTorch version
print(torch.cuda.is_available())  # Should return True if GPU is detected
print(torch.cuda.get_device_name(0))  # Should print your GPU model

#2.5.1+cu121
#True
#NVIDIA RTX A2000 Laptop GPU

In [None]:
import pandas as pd
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim

from sklearn.model_selection import train_test_split
from torch.utils.data import DataLoader, TensorDataset

In [None]:
# Copy the dataset (long form list of hotels-channels combinations) to a new DataFrame
df = hotel_city_chanel_combin_extract.copy()

# Map Hotel_IDs and Channel_IDs to integer indices
hotel_ids = df["Hotel_ID"].unique()
channel_ids = df["Canal_ID"].unique()

hotel_to_idx = {hotel: i for i, hotel in enumerate(hotel_ids)}
channel_to_idx = {channel: i for i, channel in enumerate(channel_ids)}

In [None]:
# Create interaction matrix (hotels Ã— channels)
num_hotels = len(hotel_ids)
num_channels = len(channel_ids)
interaction_matrix = np.zeros((num_hotels, num_channels)) # hotel rows x channel columns

In [None]:
# interaction_matrix 
for _, row in df.iterrows():
    h_idx = hotel_to_idx[row["Hotel_ID"]]
    c_idx = channel_to_idx[row["Canal_ID"]]
    interaction_matrix[h_idx, c_idx] = 1

In [None]:
# Split into train & test
train_data, test_data = train_test_split(interaction_matrix, test_size=0.2, random_state=42)    

# Convert to PyTorch tensors
train_tensor = torch.FloatTensor(train_data)
test_tensor = torch.FloatTensor(test_data)

train_loader = DataLoader(TensorDataset(train_tensor), batch_size=64, shuffle=True)
test_loader = DataLoader(TensorDataset(test_tensor), batch_size=64, shuffle=True)

In [None]:
# Define the Autoencoder

class Autoencoder(nn.Module):
    def __init__(self, input_dim, hidden_dim=128):
        
        super(Autoencoder, self).__init__()
        
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim // 2),
            nn.ReLU(),
        )

        self.decoder = nn.Sequential(
            nn.Linear(hidden_dim // 2, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid() # Outputs between 0 and 1
        )
    
    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded


In [None]:
# initialize the model
input_dim = num_channels # Each row is a hotel, so input_dim = number of channels ("columns")
autoencoder = Autoencoder(input_dim)

In [None]:
autoencoder

'''
Autoencoder(
  (encoder): Sequential(
    (0): Linear(in_features=732, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=64, bias=True)
    (3): ReLU()
  )
  (decoder): Sequential(
    (0): Linear(in_features=64, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=732, bias=True)
    (3): Sigmoid()
  )
)
'''

In [None]:
# Loss function & optimizer
criterion = nn.MSELoss()
optimizer = optim.Adam(autoencoder.parameters(), lr=0.001)

In [None]:
# Train the Autoencoder
epochs = 200
for epoch in range(epochs):
    autoencoder.train()
    total_loss = 0

    for batch in train_loader:
        optimizer.zero_grad()
        inputs = batch[0]  # Extract input tensor
        outputs = autoencoder(inputs)  # Forward pass
        loss = criterion(outputs, inputs)  # Compute loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss:.4f}")

In [None]:
# Generate recommendations
autoencoder.eval()
hotel_recommendations = {}

with torch.no_grad():
    reconstructed = autoencoder(torch.FloatTensor(interaction_matrix)).numpy()

In [None]:
for hotel_idx, scores in enumerate(reconstructed):
    sorted_channels = np.argsort(scores)[::-1]  # Sort channels by predicted score (descending)
    recommended_channels = [channel_ids[i] for i in sorted_channels[:50]]  # Top 50 channels
    hotel_recommendations[hotel_ids[hotel_idx]] = recommended_channels

In [None]:
hotel_recommendations

In [None]:

# Show recommendations for a few hotels
for hotel, recs in list(hotel_recommendations.items())[:5]:
    print(f"Hotel {hotel} -> Recommended Channels: {recs}")

In [None]:
hotel_recommendations.get(16573.0, [])


In [None]:
hotel_recommendations

In [None]:
# Convert dictionary to DataFrame
df_recommendations = pd.DataFrame(
    [(hotel, channel) for hotel, channels in hotel_recommendations.items() for channel in channels],
    columns=['Hotel_ID', 'Channel_ID']
)


In [None]:
df_recommendations

In [None]:
df_recommendations["Hotel_ID"] = df_recommendations["Hotel_ID"].astype(int)


In [None]:
existing_channels = hotel_city_chanel_combin_extract.groupby("Hotel_ID")["Canal_ID"].apply(set).to_dict()

In [None]:
# Filter out already existing channels
filtered_recommendations = {
    hotel: [channel for channel in channels if channel not in existing_channels.get(hotel, set())]
    for hotel, channels in hotel_recommendations.items()
}

In [None]:
# Convert to DataFrame
df_filtered_recommendations = pd.DataFrame([
    (hotel, channel) for hotel, channels in filtered_recommendations.items() for channel in channels
], columns=["Hotel_ID", "Recommended_Channel"])

df_filtered_recommendations

In [None]:
df_filtered_recommendations.to_csv("../out/autoencoder_hotel_channel_recommendations_top50.csv", index=False)


In [None]:
# Ensure model is in evaluation mode
autoencoder.eval()

# Pass the full dataset through the encoder to get latent features
with torch.no_grad():
    encoded_hotels = autoencoder.encoder(torch.FloatTensor(interaction_matrix)).numpy()  # Extract the encoded representation


In [None]:
encoded_hotels

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# Compute similarity between hotels based on their latent representations
similarity_matrix = cosine_similarity(encoded_hotels)

# Get the most similar hotels for a given hotel
hotel_id = 11332  # Example hotel

if hotel_id in hotel_to_idx:  # Ensure the hotel exists in our mapping
    similar_hotels = similarity_matrix[hotel_to_idx[hotel_id]].argsort()[-10:][::-1]
    print("Most similar hotels:", similar_hotels)
else:
    print(f"Hotel ID {hotel_id} not found in hotel_to_idx.")

In [None]:
df_filtered_recommendations

In [None]:

# Get the channel IDs for each hotel
channels_hotel_11332 = set(df_filtered_recommendations[df_filtered_recommendations["Hotel_ID"] == 11332]["Recommended_Channel"])
channels_hotel_13558 = set(df_filtered_recommendations[df_filtered_recommendations["Hotel_ID"] == 13558]["Recommended_Channel"])

# Compute the intersection of the two sets (overlap)
overlap_channels = channels_hotel_11332.intersection(channels_hotel_13558)

# Output the results
print(f"Channels for Hotel 11332: {channels_hotel_11332}")
print(f"Channels for Hotel 13558: {channels_hotel_13558}")
print(f"Overlap Channels: {overlap_channels}")

**NOTE:** Autoencoders compress the hotel-channel interaction matrix into a lower-dimensional latent space (encoder), to reconstruct the interaction matrix from that latent representation (decoder) and to minimize reconstruction error. 
These means that Autoencoders learn to "rebuild" the original matrix rather than directly capturing high-variance structures (as SVD does).

- In SVD, hotels with similar interaction patterns tend to get similar recommendations, because the singular vectors capture global patterns in the data. 
- In Autoencoders, the learned latent space might not be structured in the same way, so similar hotels might have very different activations in the bottleneck layer.

As such, we prefer to exploy methologies for capturing high-variance structures.