### Collaborative filtering

one of the most widely used core algorithms in recommendation systems. It provides personalized recommendations by automatically predicting a user's interests or preferences based on preference information (such as likes and behavioral data) collected from many users. This method is based on the assumption that "if people with similar tastes to mine like something, I am also likely to like it."


###  Item - based Collaborative Filtering (ICF)

Suppose user **A** has watched the movies "E.T." and "Indiana Jones."

- Traditional item-based collaborative filtering (ICF) recommends movies to **A** based on what other users with similar viewing histories have also watched.
- In other words, recommendations are based solely on *co-viewing patterns*:
  > "People who watched these movies also watched..."


### Relational Collaborative Filtering (RCF)

**RCF (Relational Collaborative Filtering)** is a recommender system framework that leverages multiple item relations, going beyond the traditional collaborative filtering approach. Below, you'll find a clear markdown explanation with real-world examples.

RCF doesn't just look at co-viewing. It considers various explicit relationships between items (movies, songs, etc.), such as:

- **Same director:** "E.T." and "Schindler's List" are both directed by Steven Spielberg.
- **Same genre:** "E.T." and "The Avengers" are both science fiction.
- **Same actor:** "E.T." and another movie might share an actor.


.

.

.


#### Movie Recommendation Example



##### Scenario

User **B** has watched "E.T."  
There are several possible relations between "E.T." and other movies:

- **Director:** "E.T." and "Schindler's List" (both by Spielberg)
- **Genre:** "E.T." and "The Avengers" (both science fiction)
- **Actor:** "E.T." and another movie with the same actor

##### How RCF Recommends

1. **Classify by relation type:** Identify all possible relations (director, genre, actor, etc.) between "E.T." and candidate movies.
2. **First-level attention:** Determine which relation types matter most to user **B**.
   - For example, if **B** cares more about genre, movies with the same genre as "E.T." get higher weight.
3. **Second-level attention:** Within each relation type, assess which specific values (e.g., "science fiction" vs. "action") or which directors/actors are most relevant to **B**.
   - If **B** especially likes science fiction, those movies are prioritized.
4. **Final recommendation:** Combine **B**'s interaction history and their relation-based preferences to recommend, for example, "The Avengers" (same genre) or "Schindler's List" (same director).

##### Example Recommendation Explanations

- Recommending "The Avengers":
  > "Recommended because both 'E.T.' and 'The Avengers' are science fiction movies you like."
- Recommending "Schindler's List":
  > "Recommended because you like movies directed by Steven Spielberg."

RCF thus provides both personalized recommendations and clear explanations based on diverse item relations.

### Summary Table: ICF vs. RCF

| Aspect                | Traditional ICF                              | RCF (Relational Collaborative Filtering)                |
|-----------------------|----------------------------------------------|--------------------------------------------------------|
| Main signal           | Co-viewed/co-purchased items                 | Multiple explicit item relations (director, genre, etc.)|
| Personalization       | Based on similar users or items              | Based on user-specific relation preferences            |
| Explanation           | Limited ("People also watched...")           | Rich ("Same director/genre as movies you liked")       |
| Example               | "People who watched X also watched Y"        | "Recommended because both are sci-fi movies"           |



---

#### What Data Features need?

To implement RCF (Relational Collaborative Filtering), you need not only the traditional recommender system data (user-item interactions), but also a structured representation of various item-to-item relations.

1 - user - item interaction matrix

| UserID | ItemID | Rating/Interaction |
|--------|--------|--------------------|
| U1     | Movie1 | 5                  |
| U1     | Movie3 | 4                  |
| U2     | Movie1 | 3                  |
| U2     | Movie2 | 5                  |

2 - item meta Data

| ItemID | Director      | Genre        | Actor1      | Actor2      | Year |
|--------|---------------|--------------|-------------|-------------|------|
| Movie1 | Spielberg     | Sci-Fi       | Tom Hanks   | Drew Barrymore | 1982 |
| Movie2 | Spielberg     | Drama        | Liam Neeson | Ralph Fiennes  | 1993 |
| Movie3 | Joss Whedon   | Sci-Fi       | Robert Downey Jr | Chris Evans | 2012 |

---

#### Data Analysis

df_ratings: User-Item interactions (core data!)

df_movies: Movie metadata (title, genre)

df_links: External links (IMDb, TMDb)

df_tags: User-generated tag information

In [1]:
import pandas as pd

df_links = pd.read_csv('links.csv')
df_links.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [2]:
df_movies = pd.read_csv('movies.csv')
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
df_ratings = pd.read_csv('ratings.csv')
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [4]:
df_tags = pd.read_csv('tags.csv')
df_tags.head()

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200


In [5]:
df_ratings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100836 entries, 0 to 100835
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   userId     100836 non-null  int64  
 1   movieId    100836 non-null  int64  
 2   rating     100836 non-null  float64
 3   timestamp  100836 non-null  int64  
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


Data Structure

- Total rows: 100,836 entries (100K rating records)
- Total columns: 4 columns
- Memory usage: 3.1 MB

- No missing values: Complete dataset with no gaps

- Sufficient data volume: 100K ratings available for training

- Temporal information: Timestamp enables temporal pattern analysis

- Clean data structure: All columns properly formatted


In [None]:
print("=== Basic Statistics ===")
print(f"Total number of users: {df_ratings['userId'].nunique():,}")
print(f"Total number of movies: {df_ratings['movieId'].nunique():,}")
print(f"Average rating: {df_ratings['rating'].mean():.2f}")
print(f"Rating distribution:\n{df_ratings['rating'].value_counts().sort_index()}")
user_rating_counts = df_ratings.groupby('userId').size()
print(f"\nUser rating count distribution:")
print(f"Average: {user_rating_counts.mean():.1f} ratings")
print(f"Median: {user_rating_counts.median():.1f} ratings")
print(f"Maximum: {user_rating_counts.max()} ratings")

=== Basic Statistics ===
Total number of users: 610
Total number of movies: 9,724
Average rating: 3.50
Rating distribution:
rating
0.5     1370
1.0     2811
1.5     1791
2.0     7551
2.5     5550
3.0    20047
3.5    13136
4.0    26818
4.5     8551
5.0    13211
Name: count, dtype: int64

User rating count distribution:
Average: 165.3 ratings
Median: 70.5 ratings
Maximum: 2698 ratings


- The dataset is large and diverse, with enough ratings for robust model training.

- The rating distribution is balanced, so the model can learn both positive and negative preferences.

- The presence of both heavy and light users allows for testing cold-start and personalization strategies.

---

### ICF

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
df_ratings = pd.read_csv('ratings.csv')
user_item = df_ratings.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)


item_user = user_item.T
item_sim = pd.DataFrame(
    cosine_similarity(item_user),
    index=item_user.index,
    columns=item_user.index
)
def icf_predict(user, item, user_item, item_sim, k=20):
    user_ratings = user_item.loc[user]
    rated_items = user_ratings[user_ratings > 0].index
    if item not in item_sim.index or len(rated_items) == 0:
        return user_ratings.mean() if user_ratings.mean() > 0 else 3.0
    sims = item_sim.loc[item, rated_items]
    top = sims.nlargest(k)
    if top.sum() == 0:
        return user_ratings.mean() if user_ratings.mean() > 0 else 3.0
    pred = np.dot(top, user_ratings[top.index]) / np.abs(top).sum()
    return np.clip(pred, 0.5, 5.0)





#===============================================================================



train, test = train_test_split(df_ratings, test_size=0.2, random_state=42)
train_ui = train.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)
item_user = train_ui.T
item_sim = pd.DataFrame(
    cosine_similarity(item_user),
    index=item_user.index,
    columns=item_user.index
)
test_sample = test.sample(n=1000, random_state=42)
preds = []
trues = []
for _, row in test_sample.iterrows():
    u, i, r = row['userId'], row['movieId'], row['rating']
    if u in train_ui.index:
        preds.append(icf_predict(u, i, train_ui, item_sim, k=20))
        trues.append(r)
print('RMSE:', np.sqrt(mean_squared_error(trues, preds)))











#===============================================================================
print('  ')
print("=== Recommend to USER_1 ===")
print('  ')

user_id = 1
user_rated = set(train_ui.loc[user_id][train_ui.loc[user_id] > 0].index)
all_items = set(train_ui.columns)
unrated_items = list(all_items - user_rated)


pred_scores = []
for item in unrated_items:
    score = icf_predict(user_id, item, train_ui, item_sim, k=20)
    pred_scores.append((item, score))


top_n = 10
top_recs = sorted(pred_scores, key=lambda x: x[1], reverse=True)[:top_n]


df_movies = pd.read_csv('movies.csv')
for movie_id, score in top_recs:
    title = df_movies[df_movies['movieId'] == movie_id]['title'].values[0]
    print(f"{title} (movieId={movie_id}): 예측 평점 {score:.2f}")

RMSE: 1.068017549522926
  
=== Recommend to USER_1 ===
  
The Jinx: The Life and Deaths of Robert Durst (2015) (movieId=131724): 예측 평점 5.00
Broken English (1996) (movieId=1519): 예측 평점 5.00
3 Ninjas: High Noon On Mega Mountain (1998) (movieId=1739): 예측 평점 5.00
Come See the Paradise (1990) (movieId=3106): 예측 평점 5.00
Circus (2000) (movieId=3899): 예측 평점 5.00
Kizumonogatari III: Cold Blood (2017) (movieId=168218): 예측 평점 5.00
Jungle Book 2, The (2003) (movieId=6158): 예측 평점 5.00
Betting on Zero (2016) (movieId=170907): 예측 평점 5.00
Tickling Giants (2017) (movieId=172705): 예측 평점 5.00
Tokyo Idols (2017) (movieId=173235): 예측 평점 5.00


RMSE: 1.068017549522926
 
 -> means that, on average, the predicted ratings differ from the actual ratings by about 1.06 points. An RMSE of 1.06 indicates a reasonable (average) level of recommendation accuracy.

---

### Relational Collaborative Filtering

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from collections import defaultdict
df_ratings = pd.read_csv('ratings.csv')
df_movies = pd.read_csv('movies.csv')


#===============================================================================




def extract_relations(df_movies):
    relations = defaultdict(dict)
    for _, row in df_movies.iterrows():
        movie_id = row['movieId']
        
        genres = row['genres'].split('|') if pd.notnull(row['genres']) else []
        for genre in genres:
            relations['genre'].setdefault(genre, set()).add(movie_id)
    return relations

relations = extract_relations(df_movies)
user_item = df_ratings.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)




#===============================================================================





def rcf_item_similarity(movie_id1, movie_id2, relations):
    genres1 = {g for g, ids in relations['genre'].items() if movie_id1 in ids}
    genres2 = {g for g, ids in relations['genre'].items() if movie_id2 in ids}
    if not genres1 or not genres2:
        return 0.0
    return len(genres1 & genres2) / len(genres1 | genres2)


def rcf_predict(user, item, user_item, relations, k=20):
    user_ratings = user_item.loc[user]
    rated_items = user_ratings[user_ratings > 0].index
    sims = []
    for rated_item in rated_items:
        sim = rcf_item_similarity(item, rated_item, relations)
        if sim > 0:
            sims.append((sim, user_ratings[rated_item]))
    if not sims:
        return user_ratings.mean() if user_ratings.mean() > 0 else 3.0
    sims = sorted(sims, key=lambda x: x[0], reverse=True)[:k]
    numerator = sum(sim * rating for sim, rating in sims)
    denominator = sum(abs(sim) for sim, _ in sims)
    if denominator == 0:
        return user_ratings.mean() if user_ratings.mean() > 0 else 3.0
    pred = numerator / denominator
    return np.clip(pred, 0.5, 5.0)




#===============================================================================



train, test = train_test_split(df_ratings, test_size=0.2, random_state=42)
train_ui = train.pivot_table(index='userId', columns='movieId', values='rating').fillna(0)
test_sample = test.sample(n=1000, random_state=42)
preds = []
trues = []
for _, row in test_sample.iterrows():
    u, i, r = row['userId'], row['movieId'], row['rating']
    if u in train_ui.index and i in df_movies['movieId'].values:
        preds.append(rcf_predict(u, i, train_ui, relations, k=20))
        trues.append(r)
print('RCF RMSE:', np.sqrt(mean_squared_error(trues, preds)))



print('\n=== RCF Recommend to USER_1 ===\n')
user_id = 1
user_rated = set(train_ui.loc[user_id][train_ui.loc[user_id] > 0].index)
all_items = set(train_ui.columns)
unrated_items = list(all_items - user_rated)
pred_scores = []
for item in unrated_items:
    score = rcf_predict(user_id, item, train_ui, relations, k=20)
    pred_scores.append((item, score))
top_n = 10
top_recs = sorted(pred_scores, key=lambda x: x[1], reverse=True)[:top_n]
for movie_id, score in top_recs:
    title = df_movies[df_movies['movieId'] == movie_id]['title'].values[0]
    print(f"{title} (movieId={movie_id}): 예측 평점 {score:.2f}")

RCF RMSE: 0.9327581627675675

=== RCF Recommend to USER_1 ===

Now and Then (1995) (movieId=27): 예측 평점 4.74
Babe (1995) (movieId=34): 예측 평점 4.74
White Balloon, The (Badkonake sefid) (1995) (movieId=80): 예측 평점 4.74
Fluke (1995) (movieId=241): 예측 평점 4.74
Little Princess, A (1995) (movieId=262): 예측 평점 4.74
Secret Garden, The (1993) (movieId=531): 예측 평점 4.74
Little Princess, The (1939) (movieId=917): 예측 평점 4.74
Old Yeller (1957) (movieId=1012): 예측 평점 4.74
Shiloh (1997) (movieId=1547): 예측 평점 4.74
My Dog Skip (1999) (movieId=3189): 예측 평점 4.74


### Result

#### RCF shows a lower RMSE than ICF, indicating higher prediction accuracy.


RCF can provide more accurate and explainable recommendations by leveraging various types of relational information. However, if the relational data is simple (e.g., only using genres), the recommendation results may become similar. 

Adding more relations such as director or actor can improve both the diversity and quality of recommendations. On the other hand, ICF relies only on rating patterns, so its recommendations are less explainable but can be more diverse.

.

.

### Improvement

In real services, incorporating not only genres but also additional relational information such as director, actors, and year into RCF can greatly enhance both the diversity and quality of recommendations.

This approach allows the system to better capture users’ diverse preferences and provides richer and more explainable recommendation results.



---

writer - https://github.com/Blunf/BI_MP_EDA