# IBM Recommendations — **Clean and complete notebook**
This notebook implements **all** parts of the rubric using **only** `user-item-interactions.csv` and functions centralized in `src/`.

**Sections**  
1. EDA (variables with the expected names)  
2. Rank-based (popularity)  
3. User–user Collaborative Filtering  
4. Content-based (TF-IDF on titles)  
5. Matrix Factorization (SVD) + discussion

In [1]:

# Imports y path a src/
import os, sys, pandas as pd, numpy as np
sys.path.append(os.path.abspath(os.path.join('..','src')))

from data import email_mapper
from rank import get_top_article_ids, get_top_articles, get_ranked_article_unique_counts
from utils import get_article_names
from collaborative import (
    create_user_item_matrix, get_top_sorted_users, find_similar_users,
    user_user_recs, user_user_recs_part2, get_user_articles
)
from content import (build_tfidf_from_df, select_optimal_k, make_content_recs)
from matrix_factorization import (fit_svd, get_svd_similar_article_ids)


## Data Loading

In [2]:

# Carga: ajusta la ruta si la tuya difiere
path = '../data/user-item-interactions.csv'
df = pd.read_csv(path)

# Normalizaciones típicas del notebook guía
if 'user_id' not in df.columns and 'email' in df.columns:
    df['user_id'] = email_mapper(df)
    del df['email']

# Tipos
df['article_id'] = df['article_id'].astype(int)

# Mostrar un vistazo
df.head(3)


Unnamed: 0.1,Unnamed: 0,article_id,title,user_id
0,0,1430,"using pixiedust for fast, flexible, and easier...",1
1,1,1314,healthcare python streaming application demo,2
2,2,1429,use deep learning for image classification,3


## Section I — EDA

In [None]:

user_article_interactions = (
    df.groupby('article_id')['user_id'].nunique().sort_values(ascending=False)
)

median_val = float(df.groupby('user_id')['article_id'].nunique().median())
max_views_by_user = int(df.groupby('user_id')['article_id'].nunique().max())
max_views = int(user_article_interactions.max())
most_viewed_article_id = int(user_article_interactions.idxmax())
unique_articles = int(df['article_id'].nunique())
unique_users = int(df['user_id'].nunique())
total_articles = int(df['article_id'].nunique())  # con un solo df

sol_1_dict = {
    'median_val': median_val,
    'max_views_by_user': max_views_by_user,
    'max_views': max_views,
    'most_viewed_article_id': most_viewed_article_id,
    'unique_articles': unique_articles,
    'unique_users': unique_users,
    'total_articles': total_articles
}

print('Resumen EDA:', sol_1_dict)
user_article_interactions.head(5)


Resumen EDA: {'median_val': 3.0, 'max_views_by_user': 135, 'max_views': 467, 'most_viewed_article_id': 1330, 'unique_articles': 714, 'unique_users': 5149, 'total_articles': 714}


article_id
1330    467
1429    397
1364    388
1314    345
1398    329
Name: user_id, dtype: int64

## Section II — Rank Based

In [4]:

top10_ids = get_top_article_ids(10, df)
top10_titles = get_top_articles(10, df)

print('Top 10 IDs:', top10_ids)
print('Top 10 títulos (si están en df):', top10_titles[:5])


Top 10 IDs: [1330, 1429, 1364, 1314, 1398, 1431, 1271, 1427, 43, 1160]
Top 10 títulos (si están en df): ['insights from new york car accident reports', 'use deep learning for image classification', 'predicting churn with the spss random tree algorithm', 'healthcare python streaming application demo', 'total population by country']


## Section III — Collaborative Filtering (user–user)

In [5]:

user_item = create_user_item_matrix(df)

# Tomamos un usuario con mayor número de artículos vistos para evitar edge cases
inter_per_user = df.groupby('user_id')['article_id'].nunique().sort_values(ascending=False)
some_user = int(inter_per_user.index[0])

neighbors = get_top_sorted_users(some_user, df, user_item).head(5)
recs_ids, recs_titles = user_user_recs(some_user, df, user_item, m=10)
recs_ids_part2 = user_user_recs_part2(some_user, df, user_item, m=10)

neighbors, recs_ids[:5], recs_titles[:5], recs_ids_part2[:5]


(   neighbor_id  similarity  num_interactions
 0           23    0.992593               135
 1           98    0.436935                97
 2         3764    0.436935                97
 3         3697    0.404512               100
 4           49    0.402504               101,
 [20, 40, 51, 57, 101],
 ['working interactively with rstudio and notebooks in dsx',
  'ensemble learning to improve machine learning results',
  'modern machine learning algorithms',
  'transfer learning for flight delay prediction via variational autoencoders',
  'how to choose a project to practice data science'],
 [887, 40, 173, 599, 681])

## Section IV — Content Based (TF‑IDF over titles)

In [7]:

# Si sólo tenemos 'title', usamos eso; mejora notable si luego incorporas texto real.
df_articles = df[['article_id', 'title']].drop_duplicates('article_id').reset_index(drop=True)
vect, X = build_tfidf_from_df(df_articles, text_cols=['title'])
example_article_id = int(df_articles['article_id'].iloc[0])
content_recs = make_content_recs(example_article_id, df_articles, m=10, vect=vect, X=X)


## Section V — Matrix Factorization (SVD)

In [8]:

U, s, Vt = fit_svd(user_item, k=min(50, max(2, min(user_item.shape)-1)))
svd_similar_ids = get_svd_similar_article_ids(example_article_id, user_item, Vt, include_similarity=True, m=10)
svd_similar_ids[:5]


[[1203, 0.4470512060640159],
 [486, 0.3968083380241231],
 [1173, 0.3962177219801733],
 [695, 0.3489042358720732],
 [861, 0.33150767088183763]]

## Sanity checks

In [9]:

# EDA types
assert isinstance(median_val, float)
assert isinstance(max_views_by_user, int)
assert isinstance(max_views, int)
assert isinstance(most_viewed_article_id, int)
assert isinstance(unique_articles, int)
assert isinstance(unique_users, int)
assert isinstance(total_articles, int)

# Rank-based
assert isinstance(top10_ids, list) and len(top10_ids) > 0

# CF matrices y recomendaciones
assert user_item.shape[0] > 0 and user_item.shape[1] > 0
assert isinstance(recs_ids, list)
assert isinstance(recs_ids_part2, list)

print('Sanity checks OK ✅')


Sanity checks OK ✅



### Discussion and validation
- **Popularity** works for user *cold-start*.  
- **CF** personalizes based on neighbors; sensitive to sparsity.  
- **Content** is useful for *item cold-start*.  
- **SVD** captures latent factors; choose `k` via explained variance and/or validation.

**Suggested offline validation:** temporal holdout, metrics such as `Recall@K` and `MAP@K`.  
**Online validation:** A/B test with recommendation CTR and dwell time.