# Hybrid Recommendation System

## Introduction

In the field of recommendation systems, combining multiple recommendation techniques can yield more accurate and personalized results. This document introduces a hybrid recommendation system that leverages both association rule-based and content-based recommendation methods. The goal is to provide more effective product recommendations by integrating these two approaches.

## Use Case

The hybrid recommendation system is designed to suggest products to users based on their interests. The system takes an input product and generates recommendations by combining insights from two different methods:

1. **Association Rule-Based Recommendations**: These recommendations are derived from association rules that capture relationships between products based on historical transaction data. For example, if customers who buy product A also frequently buy product B, then product B will be recommended when product A is purchased.

2. **Content-Based Recommendations**: These recommendations are based on the features of the products. The system compares the features of the input product with those of other products and suggests items that are similar in terms of content.

## Theory Behind the Hybrid Recommendation System

### 1. Association Rule-Based Recommendations

Association rule mining involves discovering interesting relationships between variables in large datasets. In the context of recommendations, it identifies products that are frequently bought together. The core components of association rules are:

- **Antecedent**: The item(s) for which we want to find recommendations.
- **Consequent**: The item(s) that are recommended based on the antecedent.
- **Confidence**: A measure of how often the consequent is purchased when the antecedent is purchased.

### 2. Content-Based Recommendations

Content-based filtering suggests items based on their attributes and the preferences of the user. This method relies on:

- **Feature Extraction**: Identifying relevant features of the products.
- **Similarity Measurement**: Calculating how similar the features of different products are to the input product.
- **Scoring**: Ranking products based on their similarity scores.

### Combining the Two Approaches

The hybrid recommendation system integrates the association rule-based and content-based recommendations to leverage the strengths of both approaches. Here's how it works:

1. **Generate Recommendations**: Obtain recommendations from both the association rules and content-based methods.
2. **Normalize Scores**: Standardize the scores from both methods to ensure comparability.
3. **Combine Scores**: Create a final score by combining normalized scores from both methods, with equal weights assigned to each.
4. **Rank and Return**: Sort the products based on the final score and return the top recommendations.

In [1]:
# Import library
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

The data used in this notebook can be downloaded from Kaggle. Please use the following link to download the data: https://www.kaggle.com/datasets/psparks/instacart-market-basket-analysis

In [2]:
# Preprocess downloaded data
order_prod = pd.read_csv("order_products__prior.csv")
orders = pd.read_csv("orders.csv")
df = pd.merge(orders,order_prod, how="inner", on="order_id")[["order_id","user_id","product_id"]]
products = pd.read_csv("products.csv")
aisles = pd.read_csv("aisles.csv")
products = pd.merge(products,aisles, how="inner", on="aisle_id")
department = pd.read_csv("departments.csv")
products = pd.merge(products,department, how="inner", on="department_id")

For the input of the function, we will need order_id (or somthing similar), user_id, product_id, product_name and products' characteristics

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32434489 entries, 0 to 32434488
Data columns (total 3 columns):
 #   Column      Dtype
---  ------      -----
 0   order_id    int64
 1   user_id     int64
 2   product_id  int64
dtypes: int64(3)
memory usage: 742.4 MB


In [4]:
products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49688 entries, 0 to 49687
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   product_id     49688 non-null  int64 
 1   product_name   49688 non-null  object
 2   aisle_id       49688 non-null  int64 
 3   department_id  49688 non-null  int64 
 4   aisle          49688 non-null  object
 5   department     49688 non-null  object
dtypes: int64(3), object(3)
memory usage: 2.3+ MB


### Association Rule-Based Recommendations

In [5]:
# Import library
from mlxtend.frequent_patterns import fpgrowth, association_rules
import statsmodels.stats.api as sms

We will split the association rule mining generation from the recommendations function. The function ```generate_association_rules``` will handle the generation of association rules and can be reused in later function without regenerating the rules unnecessarily

In [6]:
def generate_association_rules(input_df,
                               product_df,
                               user_id_col,
                               prod_id_col,
                               prod_name_col,
                               basket_id_col,
                               min_support='default',
                               metric="confidence",
                               min_threshold=0.25):
    
    """
    Generate association rules from transaction data.

    Parameters:
    - input_df: DataFrame containing transaction data (or equivalent) with columns for user IDs, product IDs, and order IDs.
    - product_df: DataFrame containing product information with columns for product IDs and product names.
    - user_id_col: Column name in input_df representing user IDs.
    - prod_id_col: Column name in input_df representing product IDs.
    - prod_name_col: Column name in product_df representing product names.
    - basket_id_col: Column name in input_df representing basket (order) IDs.
    - min_support: Minimum support threshold for frequent itemsets, can be any number between 0-1. If choose 'default', it will automatically choose the default min_support value.
    - metric: Metric to use for evaluating the rules ('confidence' or 'lift').
    - min_threshold: Minimum threshold for the metric specified.

    Returns: DataFrame containing the association rules with antecedents and consequents translated into product names.
    """


    # Create a copy of the input dataframe
    input_df_used = input_df.copy()
    product_df_used = product_df.copy()

    # Confidence interval filter for important products
    low_conf, up_conf = sms.DescrStatsW(input_df_used[prod_id_col].value_counts()).tconfint_mean()
    important_products = input_df_used[prod_id_col].value_counts()[input_df_used[prod_id_col].value_counts() > low_conf].index
    input_df_used = input_df_used[input_df_used[prod_id_col].isin(important_products)]

    # Confidence interval filter for important users
    low_conf, up_conf = sms.DescrStatsW(input_df_used[user_id_col].value_counts()).tconfint_mean()
    important_baskets = input_df_used[user_id_col].value_counts()[input_df_used[user_id_col].value_counts() > low_conf].index
    input_df_used = input_df_used[input_df_used[user_id_col].isin(important_baskets)]

    # Create a binary basket matrix
    basket = (input_df_used
              .groupby([user_id_col, prod_id_col])[basket_id_col]
              .count().unstack().notnull())

    # Apply FP-Growth algorithm (it is faster than apriori for large dataframe)
    if min_support == 'default':
        frequent_itemsets = fpgrowth(basket, min_support=4/len(basket), use_colnames=True)
    else:
        frequent_itemsets = fpgrowth(basket, min_support=min_support, use_colnames=True)

    # Generate association rules
    rules = association_rules(frequent_itemsets, metric=metric, min_threshold=min_threshold)

    # Load product names
    product_names_df = product_df_used[[prod_id_col, prod_name_col]]

    # Function to map product IDs to names
    def create_name_column(product_ids):
        return [product_names_df[product_names_df[prod_id_col] == pid][prod_name_col].values[0]
                for pid in product_ids]

    # Convert frozenset to list for ease of manipulation
    def frozenset_to_list(frozen_set):
        return list(frozen_set)

    # Add product names to rules
    rules['antecedents_names'] = rules['antecedents'].apply(lambda x: create_name_column(list(x)))
    rules['recommend_product_name'] = rules['consequents'].apply(lambda x: create_name_column(list(x)))
    rules['antecedents'] = rules['antecedents'].apply(frozenset_to_list)
    rules['consequents'] = rules['consequents'].apply(frozenset_to_list)

    # Filter rules to only include those with exactly 1 item in both antecedents and consequents
    rules = rules[(rules['antecedents'].apply(len) == 1) & (rules['consequents'].apply(len) == 1)]

    # Sort rules by confidence
    rules.sort_values(by='confidence', ascending=False, inplace=True)

    return rules

In [7]:
# Example usage
rules = generate_association_rules(input_df=df,
                                   product_df=products, 
                                   user_id_col="user_id",
                                   prod_id_col="product_id",
                                   prod_name_col="product_name",
                                   basket_id_col="order_id",
                                   min_support=0.1,
                                   metric="confidence",
                                   min_threshold=0.25)

In [8]:
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,antecedents_names,recommend_product_name
3412,[43122],[21137],0.131269,0.531834,0.10297,0.78442,1.474932,0.033157,2.171654,0.370659,[Organic Bartlett Pear],[Organic Strawberries]
525,[39928],[21137],0.159222,0.531834,0.121969,0.76603,1.440355,0.037289,2.000967,0.363623,[Organic Kiwi],[Organic Strawberries]
2221,[22035],[21137],0.1579,0.531834,0.120779,0.764909,1.438247,0.036802,1.991423,0.361844,[Organic Whole String Cheese],[Organic Strawberries]
522,[22825],[21137],0.172239,0.531834,0.130922,0.760119,1.42924,0.03932,1.951657,0.362819,[Organic D'Anjou Pears],[Organic Strawberries]
2523,[46906],[21137],0.137811,0.531834,0.102095,0.74083,1.39297,0.028802,1.8064,0.327202,[Grape White/Green Seedless],[Organic Strawberries]


Next, we will create a function for recommending products based on the association rules defined

In [9]:
def recommend_products_association_rules(input_product_id, rules, top_n=5):

    """
    Recommend products based on association rules.

    Parameters:
    - input_product_id: ID of the input product for which recommendations are to be generated.
    - rules: DataFrame containing association rules with columns for antecedents, consequents, and confidence.
    - top_n: Number of top recommendations to return.

    Returns: DataFrame containing recommended products with their IDs, names, and the confidence score from the association rules.
    """
        
    # Convert input_product_id to string
    input_product_id = str(input_product_id)
    
    # Convert antecedents to string and filter rules where the antecedent includes the input product
    filtered_rules = rules[rules['antecedents'].apply(lambda x: input_product_id in [str(item) for item in x])]

    # If the input product is not in the antecedent list, return empty dataframe
    if filtered_rules.empty:
        return pd.DataFrame(columns=['recommend_product_id', 'recommend_product_name', 'confidence'])

    # Convert consequents to string and extract recommended products
    recommendations = filtered_rules[['consequents', 'recommend_product_name', 'confidence']].explode(['consequents', 'recommend_product_name'])
    recommendations['consequents'] = recommendations['consequents'].astype(str)

    # Group by 'consequents' and 'recommend_product_name' to get the maximum confidence for each recommendation
    recommendations = recommendations.groupby(['consequents', 'recommend_product_name'])['confidence'].max().reset_index()

    # Rename columns for clarity
    recommendations.rename(columns={"consequents": "recommend_product_id", "confidence":"confidence_association_rule"}, inplace=True)

    # Sort by confidence in descending order and return the top N recommendations
    return recommendations.sort_values(by='confidence_association_rule', ascending=False).head(top_n)

In [10]:
# Example usage
recommend_products_association_rules(input_product_id="43122", 
                                     rules=rules)

Unnamed: 0,recommend_product_id,recommend_product_name,confidence_association_rule
0,21137,Organic Strawberries,0.78442


### Content-Based Recommendations

In [11]:
def recommend_products_content_based(product_df, 
                                     input_product_id,
                                     product_id_col,
                                     product_name_col,
                                     based_features,
                                     top_n=5):
    
    """
    Recommend products based on content similarity using a TF-IDF vectorization approach.

    Parameters:
    - product_df: DataFrame containing product information with columns for product IDs and feature columns.
    - input_product_id: ID of the product for which recommendations are to be generated.
    - product_id_col: Column name in product_df representing product IDs.
    - product_name_col: Column name in product_df representing product names.
    - based_features: List of column names in product_df representing the features to be used for similarity calculations.
    - top_n: Number of top recommendations to return.

    Returns: DataFrame containing recommended products with their IDs, names, and similarity scores.
    """

    # Create a copy of product dataframe
    product_df_used = product_df.copy()
    
    # Combine relevant features into a single string
    product_df_used['Combined_Features'] = product_df_used[based_features].apply(lambda x: '_'.join(x.astype(str)), axis=1)

    # Vectorize the combined features using TF-IDF
    tfidf = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf.fit_transform(product_df_used['Combined_Features'])

    # Convert both product id column and input product id to string for consistent mapping
    input_product_id = str(input_product_id)
    product_df_used[product_id_col] = product_df_used[product_id_col].astype(str)

    # Calculate cosine similarity between products
    cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

    # Check if the input_product_id exists in the DataFrame
    if input_product_id not in product_df_used[product_id_col].values:
        return "Product ID not found. Please enter a valid product ID."

    # Find the index of the input product ID
    product_index = product_df_used.index[product_df_used[product_id_col] == input_product_id].tolist()[0]

    # Get the similarity scores for all products with the input product
    sim_scores = list(enumerate(cosine_sim[product_index]))

    # Sort the products based on similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the indices and similarity scores of the top_n most similar products
    sim_indices = [i[0] for i in sim_scores[1:top_n+1]]
    sim_values = [i[1] for i in sim_scores[1:top_n+1]]

    # Ensure all indices are within bounds
    sim_indices = [i for i in sim_indices if i < len(product_df_used)]

    # Create a DataFrame with product IDs, names, and similarity scores
    recommendations = pd.DataFrame({
        'recommend_product_id': [product_df_used[product_id_col].iloc[i] for i in sim_indices],
        'recommend_product_name': [product_df_used[product_name_col].iloc[i] for i in sim_indices],
        'score_content_based': sim_values
    })
    
    # Return the DataFrame with the top N recommendations
    return recommendations

In [12]:
# Example usage
recommend_products_content_based(product_df=products, 
                                 input_product_id="43122",
                                 product_id_col="product_id",
                                 product_name_col="product_name",
                                 based_features=["product_name", "aisle", "department"])

Unnamed: 0,recommend_product_id,recommend_product_name,score_content_based
0,19881,Bartlett Pear,0.972623
1,27925,Organic Red Bartlett Pear,0.934368
2,5398,Organic Red Pear,0.680784
3,19106,Red Pear,0.641723
4,40235,Organic Bartlett Pears,0.590801


In [13]:
def hybrid_recommendation_system(product_df,
                                 rules_df, 
                                 input_product_id,
                                 product_id_col,
                                 product_name_col,
                                 based_features,
                                 top_n=5):
    
    """
    Generate hybrid product recommendations based on both association rules and content-based similarity.

    Parameters:
    - input_product_id: ID of the input product for which recommendations are to be generated.
    - product_name_col: Column name in product_df representing product names.
    - product_id_col: Column name in product_df representing product IDs.
    - rules_df: DataFrame containing association rules with columns for antecedents, consequents, and confidence.
    - product_df: DataFrame containing product information with feature columns used for content-based recommendations.
    - based_features: List of column names in product_df representing the features to be used for content-based similarity calculations.
    - top_n: Number of top recommendations to return.

    Returns: DataFrame containing recommended products with their IDs, names, and the final score based on a hybrid of association rules and content-based methods.
    """

    # Create a copy of the product DataFrame and convert product IDs to string
    product_df_used = product_df.copy()
    input_product_id = str(input_product_id)
    product_df_used[product_id_col] = product_df_used[product_id_col].astype(str)
    
    # Check if the input product ID exists in the DataFrame
    if input_product_id not in product_df_used[product_id_col].values:
        # Return an empty DataFrame if the product ID is not found
        return pd.DataFrame(columns=[product_name_col, product_id_col, 'final_score'])

    # Get recommendations based on association rules
    association_recommendations = recommend_products_association_rules(input_product_id, rules_df, top_n=top_n)
    
    # Get content-based recommendations
    content_recommendations = recommend_products_content_based(
        product_df_used, input_product_id, product_id_col, product_name_col, based_features, top_n=top_n
    )
    
    # Ensure the recommendations DataFrames have the necessary columns
    if 'recommend_product_id' not in association_recommendations.columns:
        association_recommendations = pd.DataFrame(columns=['recommend_product_id', 'recommend_product_name', 'confidence_association_rule'])
    if 'recommend_product_id' not in content_recommendations.columns:
        content_recommendations = pd.DataFrame(columns=['recommend_product_id', 'recommend_product_name', 'score_content_based'])
    
    # If no association recommendations are found, return content recommendations only
    if association_recommendations.empty:
        content_recommendations['final_score'] = content_recommendations['score_content_based']
        return content_recommendations[['recommend_product_name', 'recommend_product_id', 'final_score']].head(top_n)

    # Merge the association and content-based recommendations
    hybrid_recommendations = pd.merge(
        association_recommendations,
        content_recommendations,
        left_on=['recommend_product_id', 'recommend_product_name'],
        right_on=['recommend_product_id', 'recommend_product_name'],
        how='outer'
    )
    
    # Replace NaN values with 0 for missing scores
    hybrid_recommendations['confidence_association_rule'] = hybrid_recommendations['confidence_association_rule'].fillna(0)
    hybrid_recommendations['score_content_based'] = hybrid_recommendations['score_content_based'].fillna(0)
    
    # Normalize scores if there is more than one recommendation
    if len(hybrid_recommendations) > 1:
        # Normalize confidence scores
        if hybrid_recommendations['confidence_association_rule'].max() != hybrid_recommendations['confidence_association_rule'].min():
            hybrid_recommendations['normalized_confidence'] = (
                hybrid_recommendations['confidence_association_rule'] - hybrid_recommendations['confidence_association_rule'].min()
            ) / (hybrid_recommendations['confidence_association_rule'].max() - hybrid_recommendations['confidence_association_rule'].min())
        else:
            hybrid_recommendations['normalized_confidence'] = 0

        # Normalize similarity scores
        if hybrid_recommendations['score_content_based'].max() != hybrid_recommendations['score_content_based'].min():
            hybrid_recommendations['normalized_similarity_score'] = (
                hybrid_recommendations['score_content_based'] - hybrid_recommendations['score_content_based'].min()
            ) / (hybrid_recommendations['score_content_based'].max() - hybrid_recommendations['score_content_based'].min())
        else:
            hybrid_recommendations['normalized_similarity_score'] = 0
    else:
        # If only one recommendation, set normalized values directly
        hybrid_recommendations['normalized_confidence'] = hybrid_recommendations['confidence_association_rule']
        hybrid_recommendations['normalized_similarity_score'] = hybrid_recommendations['score_content_based']

    # Combine normalized scores into a final score with equal weights
    hybrid_recommendations['final_score'] = (
        hybrid_recommendations['normalized_confidence'] * 0.5 +
        hybrid_recommendations['normalized_similarity_score'] * 0.5
    )

    # Sort by the final score
    hybrid_recommendations = hybrid_recommendations.sort_values(by='final_score', ascending=False)
    
    # Return the top N recommendations
    return hybrid_recommendations[['recommend_product_name', 'recommend_product_id', 'final_score']].head(top_n)

In [14]:
hybrid_recommendation_system(product_df=products, 
                             rules_df=rules,
                             input_product_id="43122",
                             product_id_col="product_id",
                             product_name_col="product_name",
                             based_features=["product_name", "aisle", "department"])

Unnamed: 0,recommend_product_name,recommend_product_id,final_score
1,Bartlett Pear,19881,0.5
2,Organic Strawberries,21137,0.5
3,Organic Red Bartlett Pear,27925,0.480334
5,Organic Red Pear,5398,0.349974
0,Red Pear,19106,0.329893
