### Collaborative filtering

one of the most widely used core algorithms in recommendation systems. It provides personalized recommendations by automatically predicting a user's interests or preferences based on preference information (such as likes and behavioral data) collected from many users. This method is based on the assumption that "if people with similar tastes to mine like something, I am also likely to like it."

### SVD?

Singular Value Decomposition (SVD) is a matrix decomposition technique widely used in collaborative filtering.

In collaborative filtering, user-item rating matrices are often large and mostly empty (sparse). SVD decomposes this rating matrix into three lower-dimensional matrices, allowing the identification of user and item characteristics in a hidden latent factor space.

When a rating matrix is decomposed using SVD, users and movies are represented as coordinates in a latent feature space, such as 'for children/adults' or 'blockbuster/art films'.
For example, User A might show numerical tendencies favoring 'adult-oriented' and 'blockbuster' features, while User B might exhibit preferences for 'child-friendly' and 'art film' characteristics.

### HybridSVD Model (vs. Original SVD)

https://paperswithcode.com/paper/hybridsvd-when-collaborative-information-is

#### 1. Introduction

**SVD** is a well-known SVD-based collaborative filtering method. While it offers speed and simplicity, it suffers from critical limitations—especially when user-item interactions are sparse or when new users/items appear (cold start problem).  
**HybridSVD** addresses these limitations by incorporating side information (e.g., user attributes or item features) into the SVD framework without sacrificing computational efficiency.

#### 2. Key Differences

| Aspect                     | PureSVD                                      | HybridSVD                                                      |
|---------------------------|----------------------------------------------|----------------------------------------------------------------|
| Uses side information     | ❌ No                                         | ✅ Yes                                                         |
| Cold start support        | ❌ Weak                                       | ✅ Strong (through feature mapping)                            |
| Flexibility of latent space | ❌ Fixed (cosine-based latent space)         | ✅ Adaptive (informed by user/item similarity)                 |
| Complexity                | ✅ Simple and efficient                       | ✅ Slightly more complex, but optimized via Cholesky methods   |
| Folding-in (online update) | ✅ Supported                                  | ✅ Supported, with enhanced latent space                        |
| Parameter tuning          | ✅ Easy (rank truncation)                     | ✅ Easy (same with additional α, β parameters)                 |


#### 3. Advantages of HybridSVD

**Incorporates Side Information**

By modeling auxiliary similarities between users and items, HybridSVD can utilize genres, brands, demographic attributes, etc.

**Solves the Cold Start Problem**

HybridSVD learns mappings from features to latent vectors, allowing recommendation even when no interaction history exists.

**Retains PureSVD Strengths**

It keeps fast convergence, rank truncation, folding-in, and compatibility with sparse data.


#### 4. When to Use HybridSVD

- The dataset is sparse (e.g., e-commerce with many items but few purchases)
- Cold start items or users are common
- Side information is rich and reliable
- Real-time recommendation is required

-----------------------------------

### Analysis Dataset

In [5]:
import pandas as pd
from gensim.models import Word2Vec
from mlxtend.frequent_patterns import apriori

df = pd.read_csv('basket_analysis.csv')

df.head()

Unnamed: 0.1,Unnamed: 0,Apple,Bread,Butter,Cheese,Corn,Dill,Eggs,Ice cream,Kidney Beans,Milk,Nutmeg,Onion,Sugar,Unicorn,Yogurt,chocolate
0,0,False,True,False,False,True,True,False,True,False,False,False,False,True,False,True,True
1,1,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False
2,2,True,False,True,False,False,True,False,True,False,True,False,False,False,False,True,True
3,3,False,False,True,True,False,True,False,False,False,True,True,True,False,False,False,False
4,4,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   Unnamed: 0    999 non-null    int64
 1   Apple         999 non-null    bool 
 2   Bread         999 non-null    bool 
 3   Butter        999 non-null    bool 
 4   Cheese        999 non-null    bool 
 5   Corn          999 non-null    bool 
 6   Dill          999 non-null    bool 
 7   Eggs          999 non-null    bool 
 8   Ice cream     999 non-null    bool 
 9   Kidney Beans  999 non-null    bool 
 10  Milk          999 non-null    bool 
 11  Nutmeg        999 non-null    bool 
 12  Onion         999 non-null    bool 
 13  Sugar         999 non-null    bool 
 14  Unicorn       999 non-null    bool 
 15  Yogurt        999 non-null    bool 
 16  chocolate     999 non-null    bool 
dtypes: bool(16), int64(1)
memory usage: 23.5 KB


- Your DataFrame has 999 rows (transactions) and 17 columns.

- The first column, Unnamed: 0, is an integer index column.

- The other 16 columns represent items (e.g., Apple, Bread, Butter, etc.), and their values are of boolean type (True or False), indicating whether each item was present in the transaction.

- There are no missing values in any column.

In [7]:
print(df.apply(lambda col: col.map(lambda x: x not in [0, 1, True, False])).sum())
df = df.replace({2: 1, 3: 1, 4: 1})
frequent_itemsets = apriori(df.drop(columns=['Unnamed: 0'], errors='ignore'), min_support=0.15, use_colnames=True)
print(frequent_itemsets.head())

Unnamed: 0      997
Apple             0
Bread             0
Butter            0
Cheese            0
Corn              0
Dill              0
Eggs              0
Ice cream         0
Kidney Beans      0
Milk              0
Nutmeg            0
Onion             0
Sugar             0
Unicorn           0
Yogurt            0
chocolate         0
dtype: int64
    support  itemsets
0  0.383383   (Apple)
1  0.384384   (Bread)
2  0.420420  (Butter)
3  0.404404  (Cheese)
4  0.407407    (Corn)


- For each column (item), this shows how many values are not 0, 1, True, or False.

- Unnamed: 0 has 994 such values (all rows), which means it’s just an index column and not useful for analysis.

- All other items have 0, meaning their values are clean (only 0, 1, True, or False).

- This is the result of the apriori function, showing how frequently each item appears in all transactions (the "support").

- For example: (Apple) with support 0.383383 means Apple appears in about 38.3% of all transactions.


-----------------------------------

### Hybrid SVD Model 

In [8]:
# pip install scikit-surprise
from surprise import SVD, Dataset, Reader
import pandas as pd

# 예시 데이터프레임 생성
df = pd.read_csv('your_ratings.csv')  # user, item, rating 컬럼 필요

reader = Reader(rating_scale=(df['rating'].min(), df['rating'].max()))
data = Dataset.load_from_df(df[['user', 'item', 'rating']], reader)

trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)

ModuleNotFoundError: No module named 'surprise'