In [None]:
import pandas as pd
import numpy as np  

from sklearn.preprocessing import StandardScaler
from sklearn.metrics.pairwise import cosine_similarity

from surprise import SVD, Dataset, Reader
from surprise.model_selection import train_test_split

from collections import defaultdict
import os
pd.set_option('display.float_format', lambda x: '%.0f' % x)

In [None]:
# Information about individual channels
data_lake_prd_314410_cz_canais = pd.read_csv('../data/lookups/data-lake-prd-314410.cz.canais.csv')

In [None]:
# List of hotel-channel combinations as of January 2025
hotel_city_chanel_combin_extract  = pd.read_csv('../data/other/hotel_city_chanel_combin_extract.csv')
hotel_city_chanel_combin_extract.dropna(inplace=True)
hotel_city_chanel_combin_extract.drop(columns=['Cidade_ID'], inplace=True)
hotel_city_chanel_combin_extract.drop_duplicates(inplace=True)

In [None]:
hotel_city_chanel_combin_extract['Hotel_ID'].max()

# Singular Value Decomposition (SVD) 

In [None]:
# Pivot the table
pivot_table = hotel_city_chanel_combin_extract.pivot_table(index='Hotel_ID', columns='Canal_ID', aggfunc='size', fill_value=0)
# Convert the table to binary (1 where the combination existed, 0 otherwise)
pivot_table = pivot_table.map(lambda x: 1 if x > 0 else 0)


In [None]:
pivot_table

In [None]:
# Count the number of 1s for each column
counts_per_channel = pivot_table.sum().sort_values(ascending=False)

### Singular Value Decomposition (SVD) 

In collaborative filtering and recommendation systems, one of the most widely used techniques for matrix factorization is **Singular Value Decomposition (SVD)**. 
The purpose of using SVD is to decompose a large, sparse matrix (such as a user-item or hotel-channel matrix) into smaller, dense matrices that represent the latent factors underlying the data.

For example, in a hotel-channel recommendation system:
- The rows represent hotels (users).
- The columns represent channels (items).
- The entries represent whether a hotel uses a particular channel or not (binary, 0 or 1), or how much they use the channel (ratings).

SVD allows us to reduce the dimensionality of this matrix, uncover hidden relationships (latent factors) between hotels and channels, and predict missing values (channels a hotel might be interested in). By doing so, SVD helps generate accurate recommendations based on user (hotel) preferences and item (channel) characteristics, even for unknown combinations.

### Conceptual Explanation

SVD works by decomposing the original matrix `R` (of shape `m x n` where `m` is the number of users (hotels) and `n` is the number of items (channels)) into three smaller matrices:

$$
R = U \Sigma V^T
$$

Where:
- $R$ is the original matrix of size $m \times n $.
- $ U $ is an $ m \times k $ matrix (left singular vectors or user factors).
- $ \Sigma $ is a $ k \times k $ diagonal matrix (singular values, capturing the strength of the latent factors).
- $ V^T $ is an $ k \times n $ matrix (right singular vectors or item factors).
- $ k $ is the number of latent factors (typically much smaller than `m` and `n`, we are using 100 latent factors).

### What does this decomposition represent?

1. **U (User Features Matrix)**: Each row of matrix $ U $ represents a hotel (user) in the latent factor space, capturing the latent characteristics that describe each hotel’s preferences or behaviors.

2. **Σ (Singular Values)**: These values capture the importance of each latent factor in the decomposition. They are sorted in descending order, and larger values indicate that the corresponding latent factors are more significant.

3. **V^T (Item Features Matrix)**: Each column of matrix $ V^T $ represents a channel (item) in the latent factor space, capturing the latent characteristics that describe each channel’s attributes or features.

### How Does SVD Help in Recommendations?

SVD allows us to make predictions for missing values in the matrix (e.g., predicting how much a hotel would be expected to be paired with a certain channel that they have never used before). Once the decomposition is done, the predicted rating for a hotel-channel combination can be calculated as:

$$
\hat{R}_{ij} = U_i \Sigma V_j^T
$$

Where:
- $ \hat{R}_{ij} $ is the predicted value for the hotel $ i $ and the channel $ j $.
- $ U_i $ is the row vector corresponding to hotel $ i $ in $ U $ (user factors).
- $ \Sigma $ is the diagonal matrix of singular values.
- $ V_j^T $ is the column vector corresponding to channel $ j $ in $ V^T $ (item factors).

By multiplying these components together, we obtain an estimate of the rating (or preference) that a hotel would give to a channel.

### Mathematical Explanation

### SVD in Matrix Form

Given a matrix $ R $ (of size $ m \times n $):

$$
R = \begin{pmatrix} 
r_{11} & r_{12} & \dots & r_{1n} \\
r_{21} & r_{22} & \dots & r_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
r_{m1} & r_{m2} & \dots & r_{mn} \\
\end{pmatrix}
$$

The goal is to decompose $ R $ into three matrices $ U $, $ \Sigma $, and $ V^T $ such that:

$$
R = U \Sigma V^T
$$

- **$ U $**: Matrix of size $ m \times k $ containing the left singular vectors (user factors).
- **$ \Sigma $**: Diagonal matrix of size $ k \times k $ containing the singular values (strength of latent factors).
- **$ V^T $**: Matrix of size $ k \times n $ containing the right singular vectors (item factors).

#### The decomposition aims to approximate the original matrix $ R $ by the product of $ U $, $ \Sigma $, and $ V^T $. The latent factors in $ U $ and $ V $ capture the underlying structure and relationships between users and items (or hotels and channels), even if the original data matrix is sparse.

### Practical Aspects of SVD

1. **Dimensionality Reduction**: The number of latent factors $ k $ is typically much smaller than the number of rows and columns in the original matrix. This reduction in dimensions helps in making the computations more efficient and reveals patterns in the data that might otherwise be difficult to detect.

2. **Regularization**: In practice, regularization is often applied to prevent overfitting, particularly when dealing with sparse matrices. Regularization terms are added to the loss function to penalize large values in the latent factors.

3. **Alternating Least Squares (ALS)**: SVD is typically computed using iterative methods like **Alternating Least Squares (ALS)**, where the latent factors are updated alternately for users and items to minimize the reconstruction error of the original matrix.

### Steps in Training an SVD Model

1. **Matrix Decomposition**: The matrix $ R $ is decomposed into $ U $, $ \Sigma $, and $ V^T $.
2. **Model Fitting**: The SVD model learns the latent factors that best represent the preferences or interactions between users and items.
3. **Prediction**: For each hotel (user) and channel (item), the model predicts the rating (or preference) using the learned latent factors.

In [None]:
# Step 1: Prepare the data matrix

# Loops through the rows (hotels) and columns (channels) of the wide matrix above
# Extracts ratings from the DataFrame and stores them as (hotel, channel, rating) tuples
# Creates a new Pandas DataFrame (ratings_df) with three columns:
# -hotel: The identifier of the hotel.
# -channel: The distribution channel (e.g., Expedia, Booking.com, etc.).
# -rating: The rating or score between 0 and 1.
# Prepares the data for the Surprise library. Reader(rating_scale=(0, 1)) tells Surprise that ratings range from 0 to 1.
# Dataset.load_from_df(ratings_df, reader) converts the DataFrame into a Surprise Dataset.


def prepare_data(df):
    ratings = []
    for hotel in df.index:
        for channel in df.columns:
            ratings.append((hotel, channel, df.loc[hotel, channel]))

    ratings_df = pd.DataFrame(ratings, columns=['hotel', 'channel', 'rating'])
    reader = Reader(rating_scale=(0, 1))
    return Dataset.load_from_df(ratings_df, reader)


In [None]:
# Step 2: Train the model

# Splits the Data into Training & Test Sets
# model = SVD(n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02)
# SVD is a matrix factorization model used in collaborative filtering.
#Parameters:
# n_factors=100 → Number of latent factors (hidden features) in the model.
# n_epochs=20 → Number of training iterations.
# lr_all=0.005 → Learning rate for gradient descent.
# reg_all=0.02 → Regularization term to prevent overfitting.


def train_model(data):
    trainset, testset = train_test_split(data, test_size=0.25, random_state=42)
    model = SVD(n_factors=100, n_epochs=20, lr_all=0.005, reg_all=0.02)
    model.fit(trainset)
    return model

In [None]:
# Step 3: Generate recommendations

# Processes the prediction results from the SVD model and extracts the top N recommendations for each hotel.
# Creates a Dictionary to Store Recommendations
# top_n is a dictionary where:
# -Keys = uid (hotel ID).
# -Values = A list of tuples (iid, est), where:
# -iid = channel ID.
# -est = predicted rating.
# Processes Predictions and Stores Estimated Ratings
# for uid, iid, true_r, est, _ in predictions:
#    top_n[uid].append((iid, est))
# The predictions list contains tuples with:
# -uid: Hotel ID
# -iid: Channel ID
# -true_r: Actual rating (ignored in this function)
# -est: Predicted rating (used for ranking)
# Stores (iid, est) in top_n[uid] for each hotel.
# Sorts Channels by Predicted Rating in Descending Order
#  for uid, user_ratings in top_n.items():
#     user_ratings.sort(key=lambda x: x[1], reverse=True)
#      top_n[uid] = user_ratings[:n]
# Sorts the channels for each hotel based on estimated rating (est).
# Keeps only the top N channels with the highest predicted ratings.
# Returns the Dictionary with Top N Recommendations

def get_top_n_recommendations(predictions, n=5):
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))
    
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]
    
    return top_n

In [None]:
# Step 4: Recommend channels

# Generates top N recommended channels for a given hotel using a trained SVD model.
# Extracts All Available Channels from the Dataset
# Creates a Test Set for the Given Hotel
# Generates a list of test samples where:
# -hotel_id: The hotel we want recommendations for.
# -iid: Each possible channel.
# -0: A placeholder rating (it will be predicted).
# Uses the SVD model to predict ratings for all channels.
# Returns the Top N Channels for the Hotel

def recommend_channels(hotel_id, model, data, n=5):
    iids = data.df['channel'].unique()
    testset = [(hotel_id, iid, 0) for iid in iids]
    predictions = model.test(testset)
    top_n = get_top_n_recommendations(predictions, n)
    return top_n[hotel_id]

In [None]:
# Main execution

df = pivot_table
data = prepare_data(df)
model = train_model(data)


In [None]:
# Example usage for 1 hotel
hotel_id = df.index[0]  # Choose a hotel

recommendations = recommend_channels(hotel_id, model, data, n=50)

print(f"Recommended channels for hotel {hotel_id:.0f}:")

for channel, score in recommendations:
    print(f"{channel}: {score:.4f}")

In [None]:

def recommend_channels_exclude_existing(hotel_id, model, data, existing_channels, n=50):
    # Get unique channel IDs from the data
    iids = data.df['channel'].unique()
    
    # Generate test set for the given hotel
    testset = [(hotel_id, iid, 0) for iid in iids]
    
    # Get predictions for the test set
    predictions = model.test(testset)
    
    # Get top N recommendations
    top_n = get_top_n_recommendations(predictions, n)
    
    # Get the list of channels that the hotel already has
    existing_hotel_channels = existing_channels[existing_channels['Hotel_ID'] == hotel_id]['Canal_ID'].values
    
    # Exclude the channels that are already associated with the hotel
    filtered_recommendations = [rec for rec in top_n[hotel_id] if rec[0] not in existing_hotel_channels]
    
    return filtered_recommendations

In [None]:
# Create a dictionary to store the recommendations for each hotel
recommendations_dict = {}

# Loop through each hotel in df and get the top 50 recommended channels
for hotel_id in df.index:
    recommendations = recommend_channels_exclude_existing(hotel_id, model, data, hotel_city_chanel_combin_extract, n=50)
    
    # Store the recommendations in the dictionary
    recommendations_dict[hotel_id] = recommendations

In [None]:
import pickle

# Save the recommendations_dict using pickle
# Pickle serializes (converts) Python objects into a binary format for storage or transfer 
# Then deserializes (restores) them back to their original form when needed. 
# Serialization (Pickling): The process of converting a Python object into a byte stream (binary data) that can be saved to a file or sent over a network.
# Deserialization (Unpickling): The process of reading a byte stream (binary data) and converting it back into a Python object.
# Pickle uses a binary format to represent Python objects (not human-readable).

with open('../out/recommendations_dict.pkl', 'wb') as f:
    pickle.dump(recommendations_dict, f)

In [None]:
# Load the recommendations_dict using pickle
with open('../out/svd_recommendations_dict.pkl', 'rb') as f:
    loaded_recommendations_dict = pickle.load(f)

In [None]:
flattened_data = []

for hotel_id, recommendations in recommendations_dict.items():
    for channel_ID, score in recommendations:
        flattened_data.append({
            'Hotel_ID': hotel_id,
            'Channel_ID': channel_ID,
            'Score': score  
        })


flattened_data = pd.DataFrame(flattened_data)

In [None]:
pd.set_option('display.float_format', lambda x: '%.4f' % x)

flattened_data

In [None]:
flattened_data['Hotel_ID'] = flattened_data['Hotel_ID'].astype(int)

In [None]:
flattened_data.to_csv('../out/svd_hotel_channel_recommendations_df_top50.csv', index=False)

## Check Recommendations based on hotel similarity

In [None]:
import pandas as pd 

# List of hotel-channel combinations as of January 2025
hotel_city_chanel_combin_extract  = pd.read_csv('../data/other/hotel_city_chanel_combin_extract.csv')
hotel_city_chanel_combin_extract.dropna(inplace=True)
hotel_city_chanel_combin_extract.drop(columns=['Cidade_ID'], inplace=True)
hotel_city_chanel_combin_extract.drop_duplicates(inplace=True)

# Pivot the table
pivot_table = hotel_city_chanel_combin_extract.pivot_table(index='Hotel_ID', columns='Canal_ID', aggfunc='size', fill_value=0)
# Convert the table to binary (1 where the combination existed, 0 otherwise)
pivot_table = pivot_table.map(lambda x: 1 if x > 0 else 0)


In [None]:
pivot_table

In [None]:
flattened_data = pd.read_csv('../out/svd_hotel_channel_recommendations_df_top50.csv')

In [None]:
flattened_data

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd

def compute_hotel_similarity(pivot_table):
    """
    Computes cosine similarity between all hotels in the pivot table.
    
    Parameters:
    - pivot_table: DataFrame with Hotel_ID as index, Canal_IDs as binary columns
    
    Returns:
    - similarity_df: DataFrame with cosine similarity values between hotels
    """

    similarity_matrix = cosine_similarity(pivot_table)
    similarity_df = pd.DataFrame(similarity_matrix, 
                                 index=pivot_table.index, 
                                 columns=pivot_table.index)
    return similarity_df

In [None]:
def get_closest_hotels(hotel_id, similarity_df, top_n=5, direction="top"):
    """
    Returns the top_n most similar or least similar hotels to the given hotel_id.
    
    Parameters:
    - hotel_id: The ID of the hotel for which we want recommendations
    - similarity_df: DataFrame with cosine similarity values between hotels
    - top_n: Number of similar hotels to return (top or bottom)
    - direction: "top" for most similar, "bottom" for least similar
    
    Returns:
    - closest_hotels: List of hotel IDs of the most or least similar hotels
    """
    # Get the similarity scores for the given hotel_id
    similarity_scores = similarity_df[hotel_id].sort_values(ascending=False)
    
    # Exclude the hotel itself (first entry)
    similarity_scores = similarity_scores.iloc[1:]
    
    if direction == "top":
        # Get the top N most similar hotels (highest similarity)
        closest_hotels = similarity_scores.head(top_n)
    elif direction == "bottom":
        # Get the bottom N least similar hotels (lowest similarity)
        closest_hotels = similarity_scores.tail(top_n)
    else:
        raise ValueError("Direction must be 'top' or 'bottom'")
    
    return closest_hotels.index.tolist(), closest_hotels.values.tolist()


In [None]:
similarity_df = compute_hotel_similarity(pivot_table)

# Example: Get the top 5 closest hotels for hotel with ID '2'
hotel_id = 7941  # Example hotel ID
top_n = 5
similar_hotels, similarity_values = get_closest_hotels(hotel_id, similarity_df, top_n, 'top')
print("Most similar hotels to hotel", hotel_id, ":", similar_hotels)


In [None]:
def get_channel_recommendations(hotel_id, flattened_data):
    """
    Returns the set of recommended channels for a given hotel.
    
    Parameters:
    - hotel_id: The ID of the hotel
    - flattened_data: DataFrame with Hotel_ID, Channel_ID, and Score columns
    
    Returns:
    - recommended_channels: Set of Channel_IDs recommended for the hotel
    """
    # Filter recommendations for the given hotel
    hotel_data = flattened_data[flattened_data['Hotel_ID'] == hotel_id]
    recommended_channels = set(hotel_data['Channel_ID'])
    return recommended_channels

In [None]:
# Example: Get channel recommendations for the input hotel and closest hotels
input_hotel = hotel_id  # Example hotel ID
closest_hotels = similar_hotels  # Example closest hotels

input_hotel_channels = get_channel_recommendations(input_hotel, flattened_data)

closest_hotels_channels = {
    hotel: get_channel_recommendations(hotel, flattened_data) for hotel in closest_hotels
}

# Print the channel recommendations for the input hotel and its closest hotels
print("Input Hotel Channels:", input_hotel_channels)
for hotel, channels in closest_hotels_channels.items():
    print(f"Channels for Hotel {hotel}: {channels}")

In [None]:
def find_channel_intersection(input_hotel_channels, closest_hotels_channels):
    """
    Finds the intersection of recommended channels between the input hotel and each of its closest hotels.
    
    Parameters:
    - input_hotel_channels: Set of recommended channels for the input hotel
    - closest_hotels_channels: Dictionary with hotel IDs as keys and sets of recommended channels as values
    
    Returns:
    - intersections: Dictionary with hotel IDs as keys and the intersection of recommended channels as values
    """
    intersections = {}
    
    for hotel_id, channels in closest_hotels_channels.items():
        intersection = input_hotel_channels.intersection(channels)
        intersections[hotel_id] = intersection
    
    return intersections

# Get intersections of channel recommendations
channel_intersections = find_channel_intersection(input_hotel_channels, closest_hotels_channels)

# Print the intersections
print("\nChannel Recommendations Intersections:")
for hotel, intersection in channel_intersections.items():
    print(f"Intersection with Hotel {hotel}: {intersection}")


In [None]:
input_hotel_channels

In [None]:
# Print the length of the intersection for each hotel in the dictionary
print("Number of common channel recommendations:")

for hotel_id, intersection in channel_intersections.items():
    print(f"Hotel {hotel_id}: {len(intersection)} common channels")