# Paper Implementation: An Improved Collaborative Filtering Recommendation Algorithm and Recommendation Strategy

This project is based on the paper “An Improved Collaborative Filtering Recommendation Algorithm and Recommendation Strategy” by Xiaofeng Li and Dong Li ​. All research rights and intellectual property belong to the original authors under the Creative Commons Attribution License. We—Matteo and Julian, students at the University of Bolzano—have chosen this work as the foundation for a full analysis and software implementation of its proposed methods, in order to both validate and extend its contributions to community‑aware collaborative filtering.

## 1. Introduction

Li & Li (2019) address key limitations of traditional collaborative filtering (CF)—data sparsity, cold start, and scalability—by integrating overlapping community detection into the CF pipeline. They propose two algorithms to mine user communities from a social network projection of user–item interactions (central‑node‐based and k‑faction). By localizing neighbor selection within these communities and combining rating‐based similarity with category‐based similarity, they demonstrate significant reductions in MAE and RMSE on MovieLens‑100K.

## 2. Implementation Roadmap

Below is our high‑level plan to reproduce and extend Li & Li’s community‑aware CF framework:

1. **Dataset Preparation**
   - Download and preprocess the MovieLens 100K dataset.
   - Build the user–item rating matrix.

2. **Community Detection**
   1. **Central‑Node Algorithm**
      - Compute node degrees; seed each community with the highest‑degree node.
      - Iteratively add neighbors that maximize the local contribution \(q\).
      - Merge any two communities whose overlap \(S \ge 0.7\).
   2. **k‑Faction Algorithm**
      - Use Bron–Kerbosch to extract all cliques of size ≥ *k*.
      - Merge cliques based on an overlap threshold \(T\) and inter‑community connectivity.
      - Assign remaining nodes to their closest community; refine by maximizing modularity \(Q_c\).

3. **Community‑Based Collaborative Filtering**
   - For each target user, restrict neighbor search to their detected community.
   - Construct a user–category binary matrix (e.g. item genres or tags).
   - Compute hybrid similarity:
     \[
       \text{sim}(u,v) = (1 - \lambda)\,\text{sim}_R(u,v) + \lambda\,\text{sim}_{\text{cate}}(u,v).
     \]
   - Predict ratings by aggregating the top‑*K* most similar users’ ratings.

4. **Evaluation Framework**
   - Perform 5‑fold cross‑validation with varying training:test splits (20–80 %).
   - Measure MAE and RMSE for:
     - **CFCD** (Community‑based CF)
     - **CFC** (Cosine CF)
     - **CFP** (Pearson CF)

5. **Parameter Tuning & Experiments**
   - **Experiment 1:** Fix *K* = 30; vary train:test ratio → assess sparsity impact.
   - **Experiment 2:** Fix training ratio at 80 %; vary *K* → find optimal neighbor set size.

6. **Optimizations & Extensions**
   - Scale community detection to large graphs (e.g. using NetworkX/igraph).
   - Incorporate implicit feedback (timestamps, clicks).
   - Prototype a real‑time recommendation pipeline with incremental updates.
   - Explore deep‑learning–based community embeddings as an alternative to classic detection.

7. **Documentation & Reporting**
   - Write clear API docs and usage examples for each module.
   - Produce reproducible scripts and Jupyter notebooks.
   - Summarize results with tables, charts, and a discussion of future work.

In [6]:
import pandas as pd
import networkx as nx

## Step 1: Dataset Preparation

In this first step we download and preprocess the MovieLens 100K dataset, originally collected by the GroupLens Research Project at the University of Minnesota. The data consists of 100 000 integer ratings (1–5) from 943 users on 1 682 movies (each user has rated at least 20 titles), along with simple demographic information (age, gender, occupation, ZIP code) and detailed movie–genre tags. The raw ratings are stored in a tab‑separated file `u.data` (`user_id | item_id | rating | timestamp`), and a README (`u.info`) describes the number of users, items, and ratings. All research and usage conditions (acknowledgment, non‑commercial use, citation requirements) are documented in the accompanying LICENSE and CITATION sections.

Our code will:
1. Read `u.data` into a DataFrame.
2. Display a sample of raw ratings to verify successful loading.
3. Pivot the user–item ratings into a matrix of shape (943 × 1682), filling missing entries with 0 to prepare for collaborative‑filtering algorithms.


In [4]:
# Step 1: Load and preprocess MovieLens 100K ratings
data_path = './Data/ml-100k/u.data'
columns = ['user_id', 'item_id', 'rating', 'timestamp']

# Read raw ratings file
ratings = pd.read_csv(data_path, sep='\t', names=columns)

# Show a preview of the raw ratings
print("Raw ratings sample:")
display(ratings.head())

# Build user-item rating matrix
rating_matrix = ratings.pivot(index='user_id', columns='item_id', values='rating').fillna(0)

# Display a small part of the rating matrix
print("\nUser-Item Rating Matrix (first 5 users, first 5 items):")
display(rating_matrix.iloc[:5, :5])

Raw ratings sample:


Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596



User-Item Rating Matrix (first 5 users, first 5 items):


item_id,1,2,3,4,5
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,5.0,3.0,4.0,3.0,3.0
2,4.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0
5,4.0,3.0,0.0,0.0,0.0


## Step 2: Community Detection Algorithms

### Step 2.1: Central‑Node Overlapping Community Detection Algorithm

In this step we extract overlapping user communities from the user–item interaction graph by seeding each community with the unlabeled node of highest degree (“central node”), then growing it by iteratively adding the neighbor that maximizes the local contribution
`q = Lin / (Lin + Lout)`
(where `Lin` is the count of edges inside the candidate community and `Lout` is the count of edges from that community to the rest of the graph). Expansion stops when no neighbor can increase `q` (the global contribution `Q` is the highest `q` found). Finally, any two communities whose overlap
`S(Ci, Cj) = |Ci ∩ Cj| / |Ci ∪ Cj|`
exceeds 0.7 are merged, repeating until stable. The output is a set of densely connected, overlapping communities to be used as localized neighbor pools in our collaborative‑filtering stage.



In [9]:
def central_node_overlapping_communities(G, overlap_threshold=0.7):
    """
    Central-node based overlapping community detection.
    Args:
      G: networkx Graph (undirected)
      overlap_threshold: float in [0,1], threshold S to merge similar communities
    Returns:
      List of sets, each a community of node IDs
    """
    seed_labeled = set()
    communities = []
    # Stage 1: Community Mining
    while len(seed_labeled) < G.number_of_nodes():
        seeds = [n for n in G.nodes() if n not in seed_labeled]
        seed = max(seeds, key=lambda n: G.degree(n))
        seed_labeled.add(seed)
        C = {seed}
        Q = 0.0
        while True:
            U = {v for u in C for v in G.neighbors(u) if v not in C}
            contributions = {}
            for j in U:
                Cj = C | {j}
                Lin = G.subgraph(Cj).number_of_edges()
                Lout = sum(1 for u in Cj for v in G.neighbors(u) if v not in Cj)
                qj = Lin / (Lin + Lout) if (Lin + Lout) > 0 else 0
                contributions[j] = qj
            if not contributions:
                break
            j_star, q_max = max(contributions.items(), key=lambda item: item[1])
            if q_max >= Q:
                C.add(j_star)
                Q = q_max
            else:
                break
        communities.append(C)

    # Stage 2: Community Adjustment
    merged = True
    while merged:
        merged = False
        new_comms = []
        used = [False] * len(communities)
        for i, Ci in enumerate(communities):
            if used[i]:
                continue
            merged_comm = set(Ci)
            used[i] = True
            for j in range(i+1, len(communities)):
                if used[j]:
                    continue
                Cj = communities[j]
                S = len(merged_comm & Cj) / len(merged_comm | Cj)
                if S >= overlap_threshold:
                    merged_comm |= Cj
                    used[j] = True
                    merged = True
            new_comms.append(merged_comm)
        communities = new_comms

    return communities

# Test on Karate Club
G = nx.karate_club_graph()
comms = central_node_overlapping_communities(G)
print("Detected communities:")
for i, comm in enumerate(comms, 1):
    print(f"Community {i}: {sorted(comm)}")


Detected communities:
Community 1: [2, 8, 9, 14, 15, 18, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]
Community 2: [0, 1, 2, 3, 7, 8, 9, 11, 12, 13, 17, 19, 21, 28, 30]
Community 3: [24, 25, 28, 31]
Community 4: [23, 24, 25, 27, 28, 31]
Community 5: [4, 5, 6, 10, 16]


### Step 2.2: k‑Faction Algorithm

In [None]:
pass

## Step 3: Community‑Based Collaborative Filtering