**Problem 1: Recommender System using Collaborative Filtering**

Implement a Movie Recommendation System and run it on the Movie Lens Dataset (Train vs Test). Mesure performance on test set using RMSE



*   First you are required to compute first a user-user similarity based on ratings and movies in common
*   Second, make rating predictions on the test set following the KNN idea: a prediction (user, movie) is the weighted average of other users' rating for the movie, weighted by user-similarity to the given user.

In [None]:
import pandas as pd
import numpy as np
import zipfile
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# 1. Load Movielens 100k data
# Download and extract the dataset, then point to the u.data file
with zipfile.ZipFile('ml-100k.zip','r') as z:
    with z.open('ml-100k/u.data') as f:
        ratings = pd.read_csv(
            f,
            sep='\t',
            names=['user_id','item_id','rating','timestamp'],
            usecols=['user_id','item_id','rating']
        )

# 2. Train/test split
train_df, test_df = train_test_split(ratings, test_size=0.2, random_state=42)

# 3. Build user-item rating matrix from train set
ratings_matrix = train_df.pivot_table(
    index='user_id', columns='item_id', values='rating'
)

# 4. Compute user-user similarity (Pearson)
# We use pandas' corr on transposed matrix: correlations over items axis
sim_matrix = ratings_matrix.T.corr(method='pearson', min_periods=1)

# 5. Prediction function using KNN

def predict_rating(user_id, item_id, ratings_mat, sim_mat, k=5):
    if item_id not in ratings_mat.columns:
        # Movie not seen in training
        return ratings_mat.stack().mean()

    # similarities for target user to all others
    sims = sim_mat[user_id].drop(index=user_id).dropna()
    # ratings of other users for this item
    item_ratings = ratings_mat[item_id].dropna()

    # intersect neighbors
    common_users = sims.index.intersection(item_ratings.index)
    if len(common_users) == 0:
        return ratings_mat.stack().mean()

    # select top-k similar users
    top_k = sims.loc[common_users].abs().nlargest(k).index
    top_sims = sims.loc[top_k]
    top_ratings = item_ratings.loc[top_k]

    # weighted average
    num = (top_sims * top_ratings).sum()
    den = top_sims.abs().sum()
    if den == 0:
        return ratings_mat.stack().mean()
    return num / den

# 6. Predict for test set and evaluate RMSE
preds = []
truths = []
for _, row in test_df.iterrows():
    u, i, r = row['user_id'], row['item_id'], row['rating']
    pred = predict_rating(u, i, ratings_matrix, sim_matrix, k=5)
    preds.append(pred)
    truths.append(r)

rmse = np.sqrt(mean_squared_error(truths, preds))
print(f"Test RMSE (user-user CF, k=5): {rmse:.4f}")

Test RMSE (user-user CF, k=5): 2.4288


**Problem 3A: Social Community Detection**

Implement edge-removal community detection algorithm on the Flicker Graph. Use the betweeness idea on edges and the Girvan–Newman Algorithm. The original dataset graph has more than 5M edges; in DM_resources there are 4 different sub-sampled graphs with edge counts from 2K to 600K; you can use these if the original is too big.
You should use a library to support graph operations (edges, vertices, paths, degrees, etc). We used igraph in python which also have builtin community detection algorithms (not allowed); these are useful as a way to evaluate communities you obtain

In [None]:
# install igraph’s C core
!apt-get update -qq
!apt-get install -y libigraph0-dev

# install the Python bindings
!pip install python-igraph

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Package libigraph0-dev is not available, but is referred to by another package.
This may mean that the package is missing, has been obsoleted, or
is only available from another source

E: Package 'libigraph0-dev' has no installation candidate


SystemExit: This module defines load_graph() and girvan_newman(); import and call them interactively.

**Problem 3B: Social Community Detection**

Implement the modularity detection algorithm on this artificial graph (adj matrix written in sparse format : each row is an edge [node_id, node_id, 1]). You will need to compute the modularity matrix B and its highest-val eigenvector V1. The split vector S (+1 / -1) aligns by sign with V1; follow this paper. Partition the graph in two parts K=2).
Optional: Partition the graph in more than 2 parts, try to figure out what is a "natural" K here.

**Problem 4: Knowledge Base Question Answering**

Given is knowledge graph with entities and relations, questions with starting entity and answers, and their word embedding . For each question, navigate the graph from the start entity outwards until you find appropriate answer entities.
Utils functions (similarity, load_graphs) are given, but you can choose not to use them. This python file contains the helper functions for this homework, the only update needed to use this file is to fill in the file paths.

- The number of correct answers varies (could be 1, could many), use F1 to compare your answers with the given correct answers
- Utils functions (similarity, load_graphs) are given, but you can choose not to use them.
- Answers are given to be used for evaluation only, DO NOT USE ANSWERS IN YOUR GRAPH TRAVERSAL.
Your strategy should be a graph traversal augmented with scoring of paths; you might discard paths not promising along the way. This is similar to a focused crawl strategy. You will take a query (question) that you are trying to answer, it will have a starting entity. Begin your traversal at that starting entity, and look at all adjacent edges. Use get_rel_score_word2vecbase to get a similarity score for each edge, and traverse the edges that are promising. This part is up to you, you can cut off scores below a certain threshold, or take only the top percentage, or weight it based on the average.

There are many valid strategies. You will continue to traverse a path, until the score starts to decrease, or you notice the similarity score drops significantly (compared to the previous edges). Overall try a few different approaches, and choose one that gives you the best overall F1 score.

In [1]:
!pip install gensim
!pip install numpy==1.26.4
!pip install nltk



In [12]:
from collections import defaultdict

from gensim import models
from nltk import word_tokenize
import numpy as np
from sklearn.metrics import f1_score
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import json

# File locations
W2V_PATH      = '../content/word2vec_train_dev.dat.txt'
GRAPH_PATH    = '../content/graph.txt'
ANNOT_PATH    = '../content/annotations.txt'

# Beam-search parameters
BEAM_WIDTH  = 5     # how many partial paths to keep at each depth
MAX_DEPTH   = 3     # maximum number of hops to follow
SCORE_THRESH = 0.01  # discard any edge with similarity below this

# Function to get the cosine similarity between a relation and query
word2vec_model = models.Word2Vec.load(W2V_PATH)

def get_rel_score_word2vecbase(rel: str, query: str) -> float:
    """
    Get score for query and relation. Used to inform exploration of knowledge graph.

    :param rel: relation, or edge in knowledge graph
    :param query: query, question to answer
    :return: float score similarity between question and relation
    """
    # Relation not in embedding vocabulary
    if rel not in word2vec_model.wv:
        return 0.0
    # Relation must start with ns:
    rel = 'ns:' + rel if not rel[:3] == 'ns:' else rel

    words = word_tokenize(query.lower())
    w_embs = []
    for w in words:
        if w in word2vec_model.wv:
            w_embs.append(word2vec_model.wv[w])
    return np.mean(cosine_similarity(w_embs, [word2vec_model.wv[rel]]))


def load_node_label_lookup(filepath: str) -> dict:
    """

    Load the lookup dictionary for nodes from the provided json file.

    Args:
        filepath: Path to the json file containing the lookup dictionary.

    Returns: Dictionary of node ids to text description of node.

    """
    with open(filepath, 'rb') as fp:
        return json.load(fp)


def load_query_df(filepath: str) -> pd.DataFrame:
    """

    Load a simplified dataframe of queries. Generated from the original queries nested dictionary, this simplified
    version contains all necessary information for performing the graph traversal testing without all the extra
    information and difficult formatting. Simply loop through this dataframe row by row, start at the start node with
    the query for that row, and the expected answers are given in that same row.

    Args:
        filepath: Path to the provided parquet file

    Returns: Dataframe of queries to perform on the graph.

    """
    return pd.read_parquet(filepath)


# Function to load the graph from file
def load_graph() -> dict:
    """

    Load the graph from the given file.

    Returns: Graph, in form of node_id key, and nested list value. Nested list is adjacency list, with each list
    containing the relation, and destination node_id.

    """
    # Preparing the graph
    graph = defaultdict(list)
    for line in open(GRAPH_PATH):
        line = eval(line[:-1])
        graph[line[0]].append([line[1], line[2]])
    return graph


# Function to load the queries from file
# Preparing the queries
def load_queries() -> list:
    """

    Load the original queries file. This format can be extremely confusing, for a simplified format use load_query_df.

    Returns: Nested list, with index, node_id, relation types for answers, text description of start node, and dict of
    answers.

    """
    queries = []
    for line in open(ANNOT_PATH):
        line = eval(line[:-1])
        queries.append(line)
    return queries

# Beam search
def answer_query(start: str,
                 query: str,
                 graph: dict,
                 beam_width: int  = BEAM_WIDTH,
                 max_depth:  int  = MAX_DEPTH,
                 score_thresh: float = SCORE_THRESH
                ) -> set:
    """
    Beam‐search out from `start`. At each hop:
      - score each outgoing edge by similarity to `query`
      - prune those below `score_thresh`
      - keep only top `beam_width` partial paths
    Returns the set of final node IDs reached.
    """
    # each beam: (current_node, [scores_so_far])
    beams = [(start, [])]

    for _ in range(max_depth):
        new_beams = []
        for node, scores in beams:
            for rel, nbr in graph.get(node, []):
                s = get_rel_score_word2vecbase(rel, query)
                if s < score_thresh:
                    continue
                new_beams.append((nbr, scores + [s]))

        # rank by average score along the path
        new_beams.sort(key=lambda x: np.mean(x[1]), reverse=True)
        beams = new_beams[:beam_width]
        if not beams:
            break

    return {node for node, _ in beams}

# Evaluation
def evaluate_all(graph: dict, queries: list) -> float:
    """
    For each q = [idx, question, start, ..., answers]:
      * extract gold IDs
      * run answer_query(...)
      * compute binary F1 across the union of gold+pred
    Returns overall F1.
    """
    y_true, y_pred = [], []

    for idx, question, start_id, _, _, answers in queries:
        gold_ids = {a['AnswerArgument'] for a in answers}
        pred_ids = answer_query(start_id, question, graph)

        # build a single test‐vector per query
        labels = list(gold_ids | pred_ids)
        y_t = [1 if lbl in gold_ids else 0 for lbl in labels]
        y_p = [1 if lbl in pred_ids else 0 for lbl in labels]

        y_true.extend(y_t)
        y_pred.extend(y_p)

    return f1_score(y_true, y_pred)

# Main
if __name__ == '__main__':
    print("Loading graph…")
    graph   = load_graph()
    print("Loading queries…")
    queries = load_queries()
    print(f"Running beam‐search with width={BEAM_WIDTH}, depth={MAX_DEPTH}, thresh={SCORE_THRESH}")
    f1 = evaluate_all(graph, queries)
    print(f"\nOverall F1 score: {f1:.4f}")

Loading graph…
Loading queries…
Running beam‐search with width=5, depth=3, thresh=0.01

Overall F1 score: 0.0000
