## Seis nuevas variables:

* **user_count:** número de calificaciones realizadas por cada usuario.

* **movie_count:** número de calificaciones recibidas por cada película.

* **avg_rating_by_user:** promedio de las calificaciones dadas por cada usuario.

* **avg_rating_for_movie:** promedio de las calificaciones recibidas por cada película.

* **freq_pair_count:** cantidad de pares frecuentes (con soporte ≥ 20 %) que incluyen la película en la transacción del usuario.

* **freq_pair_support_sum:** suma de los soportes de esos pares frecuentes para la película y el usuario.

## Baseline

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error

# Carga y limpieza igual que antes  
df = pd.read_csv('ratings2comoML.csv').drop('timestamp', axis=1)

# 1) Calcula la media global y su RMSE (baseline)
global_mean = df['rating'].mean()
y_true = df['rating']
y_pred_baseline = np.full_like(y_true, global_mean, dtype=float)
baseline_rmse = np.sqrt(mean_squared_error(y_true, y_pred_baseline))

print(f'RMSE baseline (media global={global_mean:.4f}): {baseline_rmse:.4f}')


RMSE baseline (media global=3.6912): 1.7342
MEJORA relativa sobre baseline: 9.3%


---
# 2

In [86]:
import pandas as pd
from itertools import combinations
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt

# 1. Carga del dataset
df = pd.read_csv('ratings2comoML.csv')

# 2. Usuarios por película
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
total_users = df['userId'].nunique()
min_support = 0.2  # umbral de soporte mínimo

# 3. Generación de ítems frecuentes de tamaño 2 y 3
frequent_pairs = {}
frequent_triples = {}

# Ítems frecuentes de tamaño 2
for m1, m2 in combinations(users_by_movie.keys(), 2):
    inter = users_by_movie[m1] & users_by_movie[m2]
    support = len(inter) / total_users
    if support >= min_support:
        frequent_pairs[frozenset({m1, m2})] = support

# Ítems frecuentes de tamaño 3
for m1, m2, m3 in combinations(users_by_movie.keys(), 3):
    inter = users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]
    support = len(inter) / total_users
    if support >= min_support:
        frequent_triples[frozenset({m1, m2, m3})] = support

# 4. Función para extraer características basadas en Apriori
def apriori_features(row):
    user = row['userId']
    target = row['movieId']
    rated = set(df[df['userId'] == user]['movieId']) - {target}

    pair_count = 0
    pair_sum = 0.0
    triple_count = 0
    triple_sum = 0.0

    # Pares frecuentes
    for other in rated:
        pair = frozenset({target, other})
        if pair in frequent_pairs:
            pair_count += 1
            pair_sum += frequent_pairs[pair]

    # Tríos frecuentes
    for combo in combinations(rated, 2):
        triple = frozenset((target,) + combo)
        if triple in frequent_triples:
            triple_count += 1
            triple_sum += frequent_triples[triple]

    return pd.Series({
        'freq_pair_count': pair_count,
        'freq_pair_support_sum': pair_sum,
        'freq_triple_count': triple_count,
        'freq_triple_support_sum': triple_sum
    })

# 5. Aplicar características y preparar dataset final
apriori_feats = df.apply(apriori_features, axis=1)
df_feat = pd.concat([df, apriori_feats], axis=1)
df_final = df_feat.drop(['userId', 'movieId', 'timestamp'], axis=1)

X = df_final.drop('rating', axis=1)
y = df_final['rating']

# 6. Entrenar modelo y calcular RMSE
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rmse = sqrt(mean_squared_error(y_test, y_pred))

print(f"RMSE: {rmse:.4f}")


RMSE: 1.4116


In [87]:
df_final

Unnamed: 0,rating,freq_pair_count,freq_pair_support_sum,freq_triple_count,freq_triple_support_sum
0,5,5.0,1.7,10.0,2.9
1,5,6.0,2.4,11.0,3.2
2,5,5.0,1.5,10.0,2.6
3,5,6.0,1.9,10.0,2.9
4,5,5.0,1.9,10.0,3.0
...,...,...,...,...,...
63,5,5.0,2.3,10.0,4.2
64,5,5.0,2.5,10.0,4.2
65,5,5.0,2.0,10.0,3.6
66,1,5.0,2.3,10.0,4.2


En este código se están construyendo cuatro nuevas variables (features) basadas en patrones frecuentes de co-calificación entre películas, calculados con un enfoque al estilo Apriori. Cada una mide, para un usuario y una película objetivo, cuántos y cuánta “fuerza” de asociación tiene con otras películas que ese usuario ya calificó:

* freq_pair_count

Número de pares frecuentes que involucran a la película objetivo y otra película que el usuario ya calificó.

Es decir, cuenta cuántas veces existe al menos una segunda película “other” tal que el par {película_objetivo, other} cumple con el soporte mínimo (min_support) en todo el dataset.

* freq_pair_support_sum

Suma de los valores de soporte de todos esos pares frecuentes.

Cada par frecuente tiene un soporte support = (número_de_usuarios_que_calificaron_ambas) / total_de_usuarios.

Al sumar esos soportes, capturamos no solo cuántos pares hay, sino también cuán “fuertes” o frecuentes son en la comunidad.

* freq_triple_count

Número de tríos frecuentes que involucran a la película objetivo y dos películas distintas que el usuario ya calificó.

Cuenta cuántas combinaciones {película_objetivo, other1, other2} superan el umbral de soporte mínimo en el dataset.

* freq_triple_support_sum

Suma de los valores de soporte de todos esos tríos frecuentes.

Aquí el soporte se calcula como (número_de_usuarios_que_calificaron_las_tres) / total_de_usuarios.

Al agregar estos soportes, capturamos la fuerza de las asociaciones triples.



**Apriori** → encontrar itemsets frecuentes

**Feature crossing** → generar variables (freq_pair_count, freq_pair_support_sum, etc.) a partir de esos itemsets

---
# 3

In [88]:
import pandas as pd
from itertools import combinations
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
import joblib  # para guardar/cargar el modelo

# 1) Carga el dataset histórico
df = pd.read_csv('ratings2comoML.csv')

# 2) Prepara usuarios por película y calcula soportes Apriori
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
total_users = df['userId'].nunique()
min_support = 0.2

frequent_pairs = {}
frequent_triples = {}

# Pares frecuentes
for m1, m2 in combinations(users_by_movie, 2):
    inter = users_by_movie[m1] & users_by_movie[m2]
    support = len(inter) / total_users
    if support >= min_support:
        frequent_pairs[frozenset({m1, m2})] = support

# Tríos frecuentes
for m1, m2, m3 in combinations(users_by_movie, 3):
    inter = users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]
    support = len(inter) / total_users
    if support >= min_support:
        frequent_triples[frozenset({m1, m2, m3})] = support

# 3) Función para generar las 4 variables Apriori
def make_features(row):
    user = row['userId']
    movie = row['movieId']
    history = set(df[df['userId']==user]['movieId']) - {movie}

    pair_count = pair_sum = 0
    triple_count = triple_sum = 0

    # Pares
    for other in history:
        pair = frozenset({movie, other})
        if pair in frequent_pairs:
            pair_count += 1
            pair_sum += frequent_pairs[pair]

    # Tríos
    for combo in combinations(history, 2):
        triple = frozenset((movie,) + combo)
        if triple in frequent_triples:
            triple_count += 1
            triple_sum += frequent_triples[triple]

    return pd.Series({
        'freq_pair_count': pair_count,
        'freq_pair_support_sum': pair_sum,
        'freq_triple_count': triple_count,
        'freq_triple_support_sum': triple_sum
    })

# 4) Aplica sobre todo el histórico
apriori_feats = df.apply(make_features, axis=1)
df_feat = pd.concat([df, apriori_feats], axis=1)

# 5) Prepara X, y y división train/test
X = df_feat[['freq_pair_count', 'freq_pair_support_sum',
             'freq_triple_count', 'freq_triple_support_sum']]
y = df_feat['rating']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 6) Entrena el modelo
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 7) Evalúa con RMSE
y_pred = model.predict(X_test)
rmse = sqrt(mean_squared_error(y_test, y_pred))
print(f'RMSE en test: {rmse:.4f}')

# 8) Guarda el modelo para inferencia futura
joblib.dump(model, 'rf_apriori_model.pkl')
print("Modelo guardado en 'rf_apriori_model.pkl'")


RMSE en test: 1.4116
Modelo guardado en 'rf_apriori_model.pkl'


---
# 4

In [89]:
import pandas as pd
from itertools import combinations
from sklearn.ensemble import RandomForestRegressor
import joblib  # para guardar/cargar tu modelo

# --- 1) Reconstruir soportes Apriori (igual que en entrenamiento) ---
df = pd.read_csv('ratings2comoML.csv')
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
total_users = df['userId'].nunique()
min_support = 0.2

# pares y tríos frecuentes
frequent_pairs = {}
frequent_triples = {}

for m1, m2 in combinations(users_by_movie, 2):
    inter = users_by_movie[m1] & users_by_movie[m2]
    support = len(inter) / total_users
    if support >= min_support:
        frequent_pairs[frozenset({m1, m2})] = support

for m1, m2, m3 in combinations(users_by_movie, 3):
    inter = users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]
    support = len(inter) / total_users
    if support >= min_support:
        frequent_triples[frozenset({m1, m2, m3})] = support

# --- 2) Función para extraer características dado un par (user, movie) ---
def make_features(user_id, movie_id):
    rated = set(df[df['userId']==user_id]['movieId']) - {movie_id}
    pair_count = pair_sum = 0
    triple_count = triple_sum = 0

    # pares
    for other in rated:
        pair = frozenset({movie_id, other})
        if pair in frequent_pairs:
            pair_count += 1
            pair_sum += frequent_pairs[pair]

    # tríos
    for combo in combinations(rated, 2):
        triple = frozenset((movie_id,) + combo)
        if triple in frequent_triples:
            triple_count += 1
            triple_sum += frequent_triples[triple]

    return {
        'freq_pair_count': pair_count,
        'freq_pair_support_sum': pair_sum,
        'freq_triple_count': triple_count,
        'freq_triple_support_sum': triple_sum
    }

# --- 3) Carga tu modelo entrenado ---
model: RandomForestRegressor = joblib.load(r"C:\Users\juanj\Desktop\Primer test Apr Auto\rf_apriori_model.pkl")

# --- 4) Predicción para un nuevo usuario y película ---
new_user = 7
new_movie = 5
feat_dict = make_features(new_user, new_movie)
X_new = pd.DataFrame([feat_dict])

predicted_rating = model.predict(X_new)[0]
print(f"Predicción de rating para user {new_user} y movie {new_movie}: {predicted_rating:.2f}")


Predicción de rating para user 7 y movie 5: 2.91


---
# 5

In [90]:
import pandas as pd
from itertools import combinations
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

# 1) Carga del dataset y generación de características Apriori (como antes)
df = pd.read_csv('ratings2comoML.csv')
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
total_users = df['userId'].nunique()
min_support = 0.2
frequent_pairs, frequent_triples = {}, {}
for m1, m2 in combinations(users_by_movie, 2):
    inter = users_by_movie[m1] & users_by_movie[m2]
    sup = len(inter) / total_users
    if sup >= min_support:
        frequent_pairs[frozenset({m1, m2})] = sup
for m1, m2, m3 in combinations(users_by_movie, 3):
    inter = users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]
    sup = len(inter) / total_users
    if sup >= min_support:
        frequent_triples[frozenset({m1, m2, m3})] = sup

def make_features(row):
    user,row_movie = row['userId'], row['movieId']
    hist = set(df[df['userId']==user]['movieId']) - {row_movie}
    pc = ps = tc = ts = 0
    for other in hist:
        p = frozenset({row_movie, other})
        if p in frequent_pairs:
            pc += 1; ps += frequent_pairs[p]
    for combo in combinations(hist,2):
        t = frozenset((row_movie,)+combo)
        if t in frequent_triples:
            tc += 1; ts += frequent_triples[t]
    return pd.Series({
        'freq_pair_count': pc,
        'freq_pair_support_sum': ps,
        'freq_triple_count': tc,
        'freq_triple_support_sum': ts
    })

feats = df.apply(make_features, axis=1)
df_feat = pd.concat([df, feats], axis=1)
X = df_feat[['freq_pair_count','freq_pair_support_sum',
             'freq_triple_count','freq_triple_support_sum']]
y = df_feat['rating']

# 2) División y entrenamiento
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 3) Seleccionar muestra aleatoria del test
sample = X_test.sample(n=1, random_state=24)
idx = sample.index[0]
X_sample = sample
y_actual = y_test.loc[idx]

# 4) Predicción
predicted = model.predict(X_sample)[0]

# 5) Mostrar resultados
print(f"Índice original: {idx}")
print("Características de entrada:", X_sample.to_dict(orient='records')[0])
print(f"Rating REAL: {y_actual}")
print(f"Rating PREDICHO: {predicted:.4f}")


Índice original: 65
Características de entrada: {'freq_pair_count': 5.0, 'freq_pair_support_sum': 2.0, 'freq_triple_count': 10.0, 'freq_triple_support_sum': 3.5999999999999996}
Rating REAL: 5
Rating PREDICHO: 4.6000


In [91]:
import pandas as pd
from itertools import combinations
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt

# 1) Carga del dataset y generación de características Apriori
df = pd.read_csv('ratings2comoML.csv')
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
total_users = df['userId'].nunique()
min_support = 0.2

frequent_pairs, frequent_triples = {}, {}
for m1, m2 in combinations(users_by_movie, 2):
    inter = users_by_movie[m1] & users_by_movie[m2]
    sup = len(inter) / total_users
    if sup >= min_support:
        frequent_pairs[frozenset({m1, m2})] = sup
for m1, m2, m3 in combinations(users_by_movie, 3):
    inter = users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]
    sup = len(inter) / total_users
    if sup >= min_support:
        frequent_triples[frozenset({m1, m2, m3})] = sup

def make_features(row):
    user, movie = row['userId'], row['movieId']
    hist = set(df[df['userId']==user]['movieId']) - {movie}
    pc = ps = tc = ts = 0
    for other in hist:
        p = frozenset({movie, other})
        if p in frequent_pairs:
            pc += 1; ps += frequent_pairs[p]
    for combo in combinations(hist, 2):
        t = frozenset((movie,) + combo)
        if t in frequent_triples:
            tc += 1; ts += frequent_triples[t]
    return pd.Series({
        'freq_pair_count': pc,
        'freq_pair_support_sum': ps,
        'freq_triple_count': tc,
        'freq_triple_support_sum': ts
    })

feats = df.apply(make_features, axis=1)
df_feat = pd.concat([df, feats], axis=1)

# 2) Preparar X, y y división train/test
X = df_feat[['freq_pair_count', 'freq_pair_support_sum',
             'freq_triple_count', 'freq_triple_support_sum']]
y = df_feat['rating']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 3) Entrenar el modelo
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# 4) Predecir en el conjunto de prueba
y_pred = model.predict(X_test)

# 5) Calcular RMSE
rmse = sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE en test: {rmse:.4f}")


RMSE en test: 1.4116


---
# Mas features

## RandomForestRegressor Geners

In [92]:
import pandas as pd
from itertools import combinations
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error

# 1. Carga del dataset
df = pd.read_csv('ratings2comoML.csv')

# 2. Usuario-por-película y soportes
total_users = df['userId'].nunique()
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
movie_support = {m: len(u) / total_users for m, u in users_by_movie.items()}

# 3. Generar ítems frecuentes de tamaño 2 y 3 (soporte ≥ 0.2)
min_support = 0.2
frequent_pairs = {frozenset({m1, m2}): len(users_by_movie[m1] & users_by_movie[m2]) / total_users
                  for m1, m2 in combinations(users_by_movie.keys(), 2)
                  if len(users_by_movie[m1] & users_by_movie[m2]) / total_users >= min_support}
frequent_triples = {frozenset({m1, m2, m3}):
                    len(users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]) / total_users
                    for m1, m2, m3 in combinations(users_by_movie.keys(), 3)
                    if len(users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]) / total_users >= min_support}

# 4. Extracción de features avanzadas
def apriori_features_ultimate(row):
    user = row['userId']
    target = row['movieId']
    rated = set(df[df['userId'] == user]['movieId']) - {target}
    
    # Unarios
    sup_target = movie_support.get(target, 0.0)
    cnt_rated = len(rated)
    
    # Inicializar
    pair_supports = []
    pair_leverages = []
    confs = []
    lifts = []
    weighted_ratings = []
    
    triple_supports = []
    triple_leverages = []
    triple_lifts = []
    
    for other in rated:
        pair = frozenset({target, other})
        if pair in frequent_pairs:
            sup = frequent_pairs[pair]
            pair_supports.append(sup)
            t_sup = sup_target
            o_sup = movie_support.get(other, 0)
            pair_leverages.append(sup - t_sup * o_sup)
            if t_sup > 0:
                confs.append(sup / t_sup)
            if t_sup > 0 and o_sup > 0:
                lifts.append(sup / (t_sup * o_sup))
            # Weighted by support
            rating_other = df[(df['userId'] == user) & (df['movieId'] == other)]['rating'].iloc[0]
            weighted_ratings.append(sup * rating_other)
    
    for combo in combinations(rated, 2):
        triple = frozenset((target,) + combo)
        if triple in frequent_triples:
            sup3 = frequent_triples[triple]
            triple_supports.append(sup3)
            t_sup = sup_target
            o1_sup = movie_support.get(combo[0], 0)
            o2_sup = movie_support.get(combo[1], 0)
            triple_leverages.append(sup3 - t_sup * o1_sup * o2_sup)
            if t_sup > 0 and o1_sup > 0 and o2_sup > 0:
                triple_lifts.append(sup3 / (t_sup * o1_sup * o2_sup))
    
    # Features cálculo
    return pd.Series({
        'sup_target': sup_target,
        'cnt_rated': cnt_rated,
        'freq_pair_count': len(pair_supports),
        'freq_pair_support_sum': sum(pair_supports),
        'max_pair_support': max(pair_supports) if pair_supports else 0.0,
        'min_pair_support': min(pair_supports) if pair_supports else 0.0,
        'avg_pair_support': sum(pair_supports)/len(pair_supports) if pair_supports else 0.0,
        'sum_pair_leverage': sum(pair_leverages),
        'max_pair_leverage': max(pair_leverages) if pair_leverages else 0.0,
        'max_pair_confidence': max(confs) if confs else 0.0,
        'avg_pair_lift': sum(lifts)/len(lifts) if lifts else 0.0,
        'max_pair_lift': max(lifts) if lifts else 0.0,
        'weighted_avg_rating_pair': sum(weighted_ratings)/sum(pair_supports) if pair_supports else 0.0,
        'freq_triple_count': len(triple_supports),
        'freq_triple_support_sum': sum(triple_supports),
        'avg_triple_support': sum(triple_supports)/len(triple_supports) if triple_supports else 0.0,
        'max_triple_support': max(triple_supports) if triple_supports else 0.0,
        'sum_triple_leverage': sum(triple_leverages),
        'max_triple_lift': max(triple_lifts) if triple_lifts else 0.0,
        'avg_triple_lift': sum(triple_lifts)/len(triple_lifts) if triple_lifts else 0.0,
        'triple_coverage': len(triple_supports)/ (cnt_rated*(cnt_rated-1)/2) if cnt_rated > 1 else 0.0
    })

# 5. Generar dataset final
ultimate_feats = df.apply(apriori_features_ultimate, axis=1)
df_ult = pd.concat([df, ultimate_feats], axis=1).drop(['userId','movieId','timestamp'], axis=1)
X = df_ult.drop('rating', axis=1)
y = df_ult['rating']

# 6. Dividir y entrenar modelos
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Random Forest optimizado
rf = RandomForestRegressor(n_estimators=549, max_depth=5, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
pred_rf = rf.predict(X_test)
rmse_rf = sqrt(mean_squared_error(y_test, pred_rf))

# HistGradientBoosting
hgb = HistGradientBoostingRegressor(random_state=42)
hgb.fit(X_train, y_train)
pred_hgb = hgb.predict(X_test)
rmse_hgb = sqrt(mean_squared_error(y_test, pred_hgb))

print(f"RMSE RandomForest: {rmse_rf:.4f}")
print(f"RMSE HistGradientBoosting: {rmse_hgb:.4f}")


RMSE RandomForest: 1.1863
RMSE HistGradientBoosting: 1.3197


## XGBoost

In [111]:
import pandas as pd
from itertools import combinations
from math import sqrt
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

# 1. Carga del dataset
df = pd.read_csv('ratings2comoML.csv')

# 2. Usuario-por-película y soporte global
total_users = df['userId'].nunique()
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
movie_support = {m: len(u) / total_users for m, u in users_by_movie.items()}

# 3. Generar ítems frecuentes de tamaño 2 y 3 (soporte ≥ 0.2)
min_support = 0.2
frequent_pairs = {
    frozenset([m1, m2]): len(users_by_movie[m1] & users_by_movie[m2]) / total_users
    for m1, m2 in combinations(users_by_movie.keys(), 2)
    if len(users_by_movie[m1] & users_by_movie[m2]) / total_users >= min_support
}
frequent_triples = {
    frozenset([m1, m2, m3]): len(users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]) / total_users
    for m1, m2, m3 in combinations(users_by_movie.keys(), 3)
    if len(users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]) / total_users >= min_support
}

# 4. Función Apriori para pares y tríos
def apriori_features(row):
    user = row['userId']
    target = row['movieId']
    rated = set(df[df['userId'] == user]['movieId']) - {target}
    pc = ps = tc = ts = 0
    for other in rated:
        pair = frozenset([target, other])
        if pair in frequent_pairs:
            pc += 1
            ps += frequent_pairs[pair]
    for combo in combinations(rated, 2):
        tri = frozenset([target] + list(combo))
        if tri in frequent_triples:
            tc += 1
            ts += frequent_triples[tri]
    return pd.Series({
        'sup_target': movie_support.get(target, 0.0),
        'cnt_rated': len(rated),
        'freq_pair_count': pc,
        'freq_pair_support_sum': ps,
        'freq_triple_count': tc,
        'freq_triple_support_sum': ts
    })

# 5. Aplicar Apriori
apriori_feats = df.apply(apriori_features, axis=1)
df_feat = pd.concat([df, apriori_feats], axis=1)

# 6. Matriz usuario×película y SVD
user_item = df.pivot(index='userId', columns='movieId', values='rating').fillna(0)
n_comp = min(20, user_item.shape[1] - 1, user_item.shape[0] - 1)
svd = TruncatedSVD(n_components=n_comp, random_state=42)
user_latent = svd.fit_transform(user_item)
movie_latent = svd.components_.T

user_latent_df = pd.DataFrame(user_latent, index=user_item.index,
                              columns=[f'u_lat_{i}' for i in range(n_comp)])
movie_latent_df = pd.DataFrame(movie_latent, index=user_item.columns,
                               columns=[f'm_lat_{i}' for i in range(n_comp)])

df_feat = df_feat.merge(user_latent_df, left_on='userId', right_index=True)
df_feat = df_feat.merge(movie_latent_df, left_on='movieId', right_index=True)

# 7. Preparar X, y
df_final = df_feat.drop(['userId', 'movieId', 'timestamp'], axis=1)
X = df_final.drop('rating', axis=1)
y = df_final['rating']

# 8. División
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 9. Entrenar XGBoost
xgb = XGBRegressor(n_estimators=200, max_depth=6, learning_rate=0.1,
                   objective='reg:squarederror', random_state=42, n_jobs=-1)
xgb.fit(X_train, y_train)
pred_xgb = xgb.predict(X_test)
rmse_xgb = sqrt(mean_squared_error(y_test, pred_xgb))

print(f"RMSE XGBoost: {rmse_xgb:.4f}")
len(X.columns)

RMSE XGBoost: 1.6048


24

In [112]:
import pandas as pd
from itertools import combinations
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

# Carga del dataset
df = pd.read_csv('ratings2comoML.csv')

# Pre-cálculo de soportes
total_users = df['userId'].nunique()
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
movie_support = {m: len(u) / total_users for m, u in users_by_movie.items()}

# Soporte mínimo y generación de pares y tríos frecuentes
min_support = 0.2
frequent_pairs = {
    frozenset({m1, m2}): len(users_by_movie[m1] & users_by_movie[m2]) / total_users
    for m1, m2 in combinations(users_by_movie.keys(), 2)
    if len(users_by_movie[m1] & users_by_movie[m2]) / total_users >= min_support
}
frequent_triples = {
    frozenset({m1, m2, m3}):
    len(users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]) / total_users
    for m1, m2, m3 in combinations(users_by_movie.keys(), 3)
    if len(users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]) / total_users >= min_support
}

# Función de extracción de 9 features Apriori
def apriori_features(row):
    user = row['userId']
    target = row['movieId']
    rated = set(df[df['userId'] == user]['movieId']) - {target}
    
    pair_count = pair_sum = 0
    confidences = []
    lifts = []
    triple_count = triple_sum = 0
    
    sup_target = movie_support.get(target, 0.0)
    
    for other in rated:
        pair = frozenset({target, other})
        if pair in frequent_pairs:
            sup = frequent_pairs[pair]
            pair_count += 1
            pair_sum += sup
            if sup_target > 0:
                confidences.append(sup / sup_target)
            o_sup = movie_support.get(other, 0)
            if sup_target > 0 and o_sup > 0:
                lifts.append(sup / (sup_target * o_sup))
    
    for combo in combinations(rated, 2):
        triple = frozenset({target, *combo})
        if triple in frequent_triples:
            triple_count += 1
            triple_sum += frequent_triples[triple]
    
    return pd.Series({
        'sup_target': sup_target,
        'cnt_rated': len(rated),
        'freq_pair_count': pair_count,
        'freq_pair_support_sum': pair_sum,
        'max_pair_confidence': max(confidences) if confidences else 0.0,
        'avg_pair_lift': sum(lifts)/len(lifts) if lifts else 0.0,
        'freq_triple_count': triple_count,
        'freq_triple_support_sum': triple_sum,
        'avg_triple_lift': sum(lifts)/len(lifts) if lifts else 0.0
    })

# Generar features Apriori
apriori_feats = df.apply(apriori_features, axis=1)
df_feat = pd.concat([df, apriori_feats], axis=1)

# Preparar X, y
df_final = df_feat.drop(['userId','movieId','timestamp'], axis=1)
X = df_final.drop('rating', axis=1)
y = df_final['rating']

# División train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Entrenar XGBRegressor
xgb = XGBRegressor(n_estimators=200, max_depth=6, learning_rate=0.1,
                   objective='reg:squarederror', random_state=42, n_jobs=-1)
xgb.fit(X_train, y_train)

# Predecir y evaluar
y_pred = xgb.predict(X_test)
rmse_xgb = sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE XGBoost con 9 features Apriori: {rmse_xgb:.4f}")
len(X.columns)


RMSE XGBoost con 9 features Apriori: 1.6430


9

## FactorizationMachine

In [160]:
import pandas as pd
from itertools import combinations
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import torch
from torch import nn
from torch.utils.data import TensorDataset, DataLoader

# 1. Carga y features base (idéntico a tu código)
df = pd.read_csv('ratings2comoML.csv')
df['user_avg_rating'] = df.groupby('userId')['rating'].transform('mean')
df['movie_avg_rating'] = df.groupby('movieId')['rating'].transform('mean')
df['rating_diff'] = df['user_avg_rating'] - df['movie_avg_rating']

# 2. Apriori básico (idéntico)
total_users = df['userId'].nunique()
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
min_support = 0.2

frequent_pairs = {}
for m1, m2 in combinations(users_by_movie, 2):
    inter = users_by_movie[m1] & users_by_movie[m2]
    support = len(inter) / total_users
    if support >= min_support:
        frequent_pairs[frozenset({m1, m2})] = support

frequent_triples = {}
for m1, m2, m3 in combinations(users_by_movie, 3):
    inter = users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]
    support = len(inter) / total_users
    if support >= min_support:
        frequent_triples[frozenset({m1, m2, m3})] = support

def apriori_basic(row):
    user, target = row['userId'], row['movieId']
    rated = set(df[df['userId']==user]['movieId']) - {target}
    pc = ps = tc = ts = 0
    for other in rated:
        pair = frozenset({target, other})
        if pair in frequent_pairs:
            pc += 1
            ps += frequent_pairs[pair]
    for combo in combinations(rated, 2):
        tri = frozenset((target,)+combo)
        if tri in frequent_triples:
            tc += 1
            ts += frequent_triples[tri]
    return pd.Series({
        'freq_pair_count': pc,
        'freq_pair_support_sum': ps,
        'freq_triple_count': tc,
        'freq_triple_support_sum': ts
    })

apriori_feats = df.apply(apriori_basic, axis=1)
df_model = pd.concat([df, apriori_feats], axis=1)

# 3. Preparamos tensores para PyTorch
X = df_model[['user_avg_rating','movie_avg_rating','rating_diff',
              'freq_pair_count','freq_pair_support_sum',
              'freq_triple_count','freq_triple_support_sum']].values
y = df_model['rating'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Convertimos a tensores
X_train_t = torch.FloatTensor(X_train)
y_train_t = torch.FloatTensor(y_train)
X_test_t  = torch.FloatTensor(X_test)
y_test_t  = torch.FloatTensor(y_test)

train_loader = DataLoader(
    TensorDataset(X_train_t, y_train_t),
    batch_size=1024, shuffle=True
)

# 4. Definimos el modelo FM
class FactorizationMachine(nn.Module):
    def __init__(self, n_features, k):
        super().__init__()
        self.linear = nn.Linear(n_features, 1)
        # Factores latentes V: [n_features × k]
        self.v = nn.Parameter(torch.randn(n_features, k) * 0.01)

    def forward(self, x):
        # parte lineal
        lin = self.linear(x)  # [batch,1]
        # interacción de segundo orden:
        # ( (xV)^2 - (x^2 V^2) ).sum(dim=1) * 0.5
        xv = x @ self.v              # [batch, k]
        xv2 = (x**2) @ (self.v**2)   # [batch, k]
        interactions = 0.5 * torch.sum(xv**2 - xv2, dim=1, keepdim=True)
        return lin + interactions

# 5. Entrenamiento
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = FactorizationMachine(n_features=X_train.shape[1], k=10).to(device)
opt = torch.optim.Adam(model.parameters(), lr=0.01)
loss_fn = nn.MSELoss()

for epoch in range(100):
    model.train()
    total_loss = 0.0
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)
        opt.zero_grad()
        preds = model(xb).squeeze()
        loss = loss_fn(preds, yb)
        loss.backward()
        opt.step()
        total_loss += loss.item() * xb.size(0)
    #print(f"Epoch {epoch+1:02d}  MSE train: {total_loss/len(train_loader.dataset):.4f}")

# 6. Evaluación
model.eval()
with torch.no_grad():
    preds_test = model(X_test_t.to(device)).squeeze().cpu().numpy()
rmse = sqrt(mean_squared_error(y_test, preds_test))
print(f"\nRMSE con FM PyTorch: {rmse:.4f}")
len(X_train[0])
    


RMSE con FM PyTorch: 1.5279


7

## RandomForestRegressor Geners

In [117]:
import pandas as pd
from itertools import combinations
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# 1. Leer ratings y movies (con géneros)
df_ratings = pd.read_csv('ratings2comoML.csv')
df_movies  = pd.read_csv(r"C:\Users\juanj\Desktop\ml-32m\movies.csv")  # movieId,title,genres

# 2. Preprocesar géneros: one-hot encoding
#    Cada película puede tener múltiples géneros separados por '|'
genres_expanded = df_movies['genres'].str.get_dummies(sep='|')
df_genres = pd.concat([df_movies[['movieId']], genres_expanded], axis=1)

# 3. Calcular Apriori features (pares + tríos) — igual que tu código
total_users = df_ratings['userId'].nunique()
users_by_movie = df_ratings.groupby('movieId')['userId'].apply(set).to_dict()
movie_support = {m: len(u)/total_users for m,u in users_by_movie.items()}
min_support = 0.2

# ítems frecuentes
frequent_pairs = {
    frozenset([m1,m2]): len(users_by_movie[m1]&users_by_movie[m2]) / total_users
    for m1,m2 in combinations(users_by_movie,2)
    if len(users_by_movie[m1]&users_by_movie[m2]) / total_users >= min_support
}
frequent_triples = {
    frozenset([m1,m2,m3]):
        len(users_by_movie[m1]&users_by_movie[m2]&users_by_movie[m3]) / total_users
    for m1,m2,m3 in combinations(users_by_movie,3)
    if len(users_by_movie[m1]&users_by_movie[m2]&users_by_movie[m3]) / total_users >= min_support
}

def apriori_feats(row):
    user,target = row['userId'], row['movieId']
    rated = set(df_ratings[df_ratings['userId']==user]['movieId']) - {target}
    pc = ps = tc = ts = 0
    for other in rated:
        p = frozenset([target,other])
        if p in frequent_pairs:
            pc += 1
            ps += frequent_pairs[p]
    for combo in combinations(rated,2):
        t = frozenset([target,*combo])
        if t in frequent_triples:
            tc += 1
            ts += frequent_triples[t]
    return pd.Series({
        'sup_target': movie_support.get(target,0),
        'cnt_rated': len(rated),
        'freq_pair_count': pc,
        'freq_pair_support_sum': ps,
        'freq_triple_count': tc,
        'freq_triple_support_sum': ts
    })

# 4. Construir DF con features Apriori
apriori_features = df_ratings.apply(apriori_feats, axis=1)
df = pd.concat([df_ratings, apriori_features], axis=1)

# 5. Merge con géneros
df = df.merge(df_genres, on='movieId')

# 6. Preparar X e y
X = df.drop(['rating', 'userId', 'movieId', 'timestamp', 'title'] if 'title' in df.columns else ['rating','userId','movieId','timestamp'], axis=1)
y = df['rating']

# 7. Train/Test split y entrenamiento
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
rf = RandomForestRegressor(n_estimators=549, max_depth=5, random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
pred = rf.predict(X_test)

# 8. Evaluación
rmse = sqrt(mean_squared_error(y_test, pred))
print(f"RMSE con géneros + Apriori: {rmse:.4f}")


RMSE con géneros + Apriori: 1.7374


## XGBoost 

In [119]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
from itertools import combinations
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

# 1. Carga del dataset de ratings
df = pd.read_csv('ratings2comoML.csv')

# 2. Soporte global y sets de usuarios por película
total_users = df['userId'].nunique()
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
movie_support = {m: len(u)/total_users for m,u in users_by_movie.items()}

# 3. Generar ítems frecuentes de tamaño 2 y 3
min_support = 0.2
frequent_pairs = {
    frozenset([m1,m2]): len(users_by_movie[m1]&users_by_movie[m2])/total_users
    for m1,m2 in combinations(users_by_movie,2)
    if len(users_by_movie[m1]&users_by_movie[m2])/total_users >= min_support
}
frequent_triples = {
    frozenset([m1,m2,m3]):
    len(users_by_movie[m1]&users_by_movie[m2]&users_by_movie[m3])/total_users
    for m1,m2,m3 in combinations(users_by_movie,3)
    if len(users_by_movie[m1]&users_by_movie[m2]&users_by_movie[m3])/total_users >= min_support
}

# 4. Función Apriori features ultimate
def apriori_features_ultimate(row):
    user = row['userId']
    target = row['movieId']
    rated = set(df[df['userId']==user]['movieId']) - {target}
    
    sup_target = movie_support.get(target, 0.0)
    cnt_rated = len(rated)
    
    pair_supports = []
    pair_leverages = []
    confs = []
    lifts = []
    weighted_ratings = []
    triple_supports = []
    triple_leverages = []
    triple_lifts = []
    
    for other in rated:
        p = frozenset([target,other])
        if p in frequent_pairs:
            sup = frequent_pairs[p]
            pair_supports.append(sup)
            o_sup = movie_support.get(other, 0)
            pair_leverages.append(sup - sup_target*o_sup)
            if sup_target>0:
                confs.append(sup/sup_target)
            if sup_target>0 and o_sup>0:
                lifts.append(sup/(sup_target*o_sup))
            rating_other = df[(df['userId']==user)&(df['movieId']==other)]['rating'].iloc[0]
            weighted_ratings.append(sup*rating_other)
    for combo in combinations(rated,2):
        t = frozenset([target,*combo])
        if t in frequent_triples:
            sup3 = frequent_triples[t]
            triple_supports.append(sup3)
            o1,o2 = combo
            o1_sup = movie_support.get(o1,0)
            o2_sup = movie_support.get(o2,0)
            triple_leverages.append(sup3 - sup_target*o1_sup*o2_sup)
            if sup_target>0 and o1_sup>0 and o2_sup>0:
                triple_lifts.append(sup3/(sup_target*o1_sup*o2_sup))
    return pd.Series({
        'sup_target': sup_target,
        'cnt_rated': cnt_rated,
        'freq_pair_count': len(pair_supports),
        'freq_pair_support_sum': sum(pair_supports),
        'max_pair_support': max(pair_supports) if pair_supports else 0,
        'min_pair_support': min(pair_supports) if pair_supports else 0,
        'avg_pair_support': sum(pair_supports)/len(pair_supports) if pair_supports else 0,
        'sum_pair_leverage': sum(pair_leverages),
        'max_pair_leverage': max(pair_leverages) if pair_leverages else 0,
        'max_pair_confidence': max(confs) if confs else 0,
        'avg_pair_lift': sum(lifts)/len(lifts) if lifts else 0,
        'max_pair_lift': max(lifts) if lifts else 0,
        'weighted_avg_rating_pair': sum(weighted_ratings)/sum(pair_supports) if pair_supports else 0,
        'freq_triple_count': len(triple_supports),
        'freq_triple_support_sum': sum(triple_supports),
        'avg_triple_support': sum(triple_supports)/len(triple_supports) if triple_supports else 0,
        'max_triple_support': max(triple_supports) if triple_supports else 0,
        'sum_triple_leverage': sum(triple_leverages),
        'max_triple_lift': max(triple_lifts) if triple_lifts else 0,
        'avg_triple_lift': sum(triple_lifts)/len(triple_lifts) if triple_lifts else 0,
        'triple_coverage': len(triple_supports)/(cnt_rated*(cnt_rated-1)/2) if cnt_rated>1 else 0
    })

# 5. Aplicar y preparar dataset
feat_df = df.apply(apriori_features_ultimate, axis=1)
df_ml = pd.concat([df, feat_df], axis=1)
X = df_ml.drop(['userId','movieId','timestamp','rating'], axis=1)
y = df_ml['rating']

# 6. División
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 7. Entrenar XGBoost con parámetros ajustados
xgb = XGBRegressor(
    n_estimators=300,
    max_depth=6,
    learning_rate=0.05,
    subsample=0.9,
    colsample_bytree=0.8,
    objective='reg:squarederror',
    tree_method='hist',
    random_state=42,
    n_jobs=4
)
xgb.fit(X_train, y_train)

# 8. Evaluar RMSE
y_pred = xgb.predict(X_test)
rmse = sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE XGBoost (Apriori ultimate): {rmse:.4f}")


RMSE XGBoost (Apriori ultimate): 1.1876


## XGBoost Geners

In [121]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
from itertools import combinations
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor

# 1. Carga de datasets con bajo uso de memoria
df_ratings = pd.read_csv('ratings2comoML.csv')
df_movies = pd.read_csv('movies.csv', usecols=['movieId','genres'])

# 2. Preprocesar géneros manualmente
movie_genres = df_movies.set_index('movieId')['genres']\
                .apply(lambda s: s.split('|') if isinstance(s,str) else []).to_dict()
all_genres = sorted({g for genres in movie_genres.values() for g in genres})

# 3. Apriori: soporte global y frecuentes
total_users = df_ratings['userId'].nunique()
users_by_movie = df_ratings.groupby('movieId')['userId'].apply(set).to_dict()
movie_support = {m: len(u)/total_users for m,u in users_by_movie.items()}
min_support = 0.2

# Pares/tríos frecuentes
frequent_pairs = {
    frozenset([m1,m2]): len(users_by_movie[m1]&users_by_movie[m2])/total_users
    for m1,m2 in combinations(users_by_movie,2)
    if len(users_by_movie[m1]&users_by_movie[m2])/total_users >= min_support
}
frequent_triples = {
    frozenset([m1,m2,m3]):
    len(users_by_movie[m1]&users_by_movie[m2]&users_by_movie[m3])/total_users
    for m1,m2,m3 in combinations(users_by_movie,3)
    if len(users_by_movie[m1]&users_by_movie[m2]&users_by_movie[m3])/total_users >= min_support
}

# 4. Función de features Apriori + géneros
def apriori_feats(row):
    user, target = row['userId'], row['movieId']
    rated = set(df_ratings[df_ratings['userId']==user]['movieId']) - {target}
    pc = ps = 0
    tc = ts = 0
    sup_target = movie_support.get(target,0.0)
    for other in rated:
        p = frozenset([target, other])
        if p in frequent_pairs:
            sup = frequent_pairs[p]
            pc += 1; ps += sup
    for combo in combinations(rated,2):
        t = frozenset([target,*combo])
        if t in frequent_triples:
            tc += 1; ts += frequent_triples[t]
    feats = {
        'sup_target': sup_target,
        'cnt_rated': len(rated),
        'freq_pair_count': pc,
        'freq_pair_support_sum': ps,
        'freq_triple_count': tc,
        'freq_triple_support_sum': ts
    }
    # añadir géneros one-hot
    genres = movie_genres.get(target, [])
    for g in all_genres:
        feats[f'genre_{g}'] = int(g in genres)
    return pd.Series(feats)

# 5. Generar features y merge
apriori_df = df_ratings.apply(apriori_feats, axis=1)
df_ml = pd.concat([df_ratings, apriori_df], axis=1)

# 6. Preparar X,e y
drop_cols = ['userId','movieId','timestamp']
X = df_ml.drop(drop_cols + ['rating'], axis=1)
y = df_ml['rating']

# 7. Train/Test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 8. Entrenar XGBoost
xgb = XGBRegressor(n_estimators=100, max_depth=6, learning_rate=0.05,
                   subsample=0.8, colsample_bytree=0.8,
                   objective='reg:squarederror', tree_method='hist',
                   random_state=42, n_jobs=4)
xgb.fit(X_train, y_train)

# 9. Evaluación
y_pred = xgb.predict(X_test)
rmse = sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE XGBoost (Apriori + Géneros): {rmse:.4f}")


RMSE XGBoost (Apriori + Géneros): 1.9367


## Géneros + SVD + XGBoost

In [124]:
import pandas as pd
from itertools import combinations
from math import sqrt
from sklearn.decomposition import TruncatedSVD
from sklearn.model_selection import train_test_split
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

# 1. Carga de datos
df_ratings = pd.read_csv('ratings2comoML.csv')
df_movies  = pd.read_csv('movies.csv')   # columnas: movieId, title, genres

# 2. One-hot encoding de géneros
df_genres = df_movies['genres'].str.get_dummies(sep='|')
df_genres['movieId'] = df_movies['movieId']

# 3. Pre-cálculo de soportes para Apriori
total_users = df_ratings['userId'].nunique()
users_by_movie = df_ratings.groupby('movieId')['userId'].apply(set).to_dict()
movie_support   = {m: len(u)/total_users for m,u in users_by_movie.items()}
min_support     = 0.2

frequent_pairs = {
    frozenset([m1,m2]): len(users_by_movie[m1]&users_by_movie[m2]) / total_users
    for m1,m2 in combinations(users_by_movie,2)
    if len(users_by_movie[m1]&users_by_movie[m2]) / total_users >= min_support
}
frequent_triples = {
    frozenset([m1,m2,m3]):
      len(users_by_movie[m1]&users_by_movie[m2]&users_by_movie[m3]) / total_users
    for m1,m2,m3 in combinations(users_by_movie,3)
    if len(users_by_movie[m1]&users_by_movie[m2]&users_by_movie[m3]) / total_users >= min_support
}

# 4. Función para extraer features Apriori (pares + tríos)
def apriori_features(row):
    u, target = row['userId'], row['movieId']
    seen = set(df_ratings[df_ratings['userId']==u]['movieId']) - {target}
    pc = ps = tc = ts = 0
    sup_t = movie_support.get(target, 0.0)
    for other in seen:
        p = frozenset([target, other])
        if p in frequent_pairs:
            pc += 1
            ps += frequent_pairs[p]
    for a,b in combinations(seen,2):
        t = frozenset([target,a,b])
        if t in frequent_triples:
            tc += 1
            ts += frequent_triples[t]
    return pd.Series({
        'sup_target': sup_t,
        'cnt_rated': len(seen),
        'freq_pair_count': pc,
        'freq_pair_support_sum': ps,
        'freq_triple_count': tc,
        'freq_triple_support_sum': ts
    })

# 5. Aplicar Apriori y merge con géneros
apr_feats = df_ratings.apply(apriori_features, axis=1)
df = pd.concat([df_ratings, apr_feats], axis=1)
df = df.merge(df_genres, on='movieId', how='left')

# 6. Construir matriz usuario×película y extraer factores latentes
ui = df_ratings.pivot(index='userId', columns='movieId', values='rating').fillna(0)
k = min(20, ui.shape[1]-1, ui.shape[0]-1)
svd = TruncatedSVD(n_components=k, random_state=42)
U = svd.fit_transform(ui)             # (n_users, k)
V = svd.components_.T                # (n_movies, k)

df_U = pd.DataFrame(U, index=ui.index, columns=[f'u_lat_{i}' for i in range(k)])
df_V = pd.DataFrame(V, index=ui.columns, columns=[f'm_lat_{i}' for i in range(k)])

df = df.merge(df_U, left_on='userId',  right_index=True, how='left')
df = df.merge(df_V, left_on='movieId', right_index=True, how='left')

# 7. Preparar X e y
drop_cols = ['userId','movieId','timestamp','rating','title']
X = df.drop([c for c in drop_cols if c in df.columns], axis=1)
y = df['rating']

# 8. División y entrenamiento con XGBoost
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)
model = XGBRegressor(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    objective='reg:squarederror',
    random_state=42,
    n_jobs=-1
)
model.fit(X_tr, y_tr)

# 9. Predicción y cálculo de RMSE
y_pred = model.predict(X_te)
rmse = sqrt(mean_squared_error(y_te, y_pred))
print(f"RMSE Apriori + Géneros + SVD + XGBoost: {rmse:.4f}")
len(X.columns)

RMSE Apriori + Géneros + SVD + XGBoost: 1.8158


44

## KNN

In [125]:
import pandas as pd
from itertools import combinations
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

# 1. Carga del dataset
df = pd.read_csv('ratings2comoML.csv')

# 2. Pre-cálculo de soportes y sets de usuarios por película
total_users = df['userId'].nunique()
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
movie_support = {m: len(u) / total_users for m, u in users_by_movie.items()}

# 3. Ítems frecuentes de tamaño 2 y 3 (soporte ≥ 0.2)
min_support = 0.2
frequent_pairs = {
    frozenset([m1, m2]): len(users_by_movie[m1] & users_by_movie[m2]) / total_users
    for m1, m2 in combinations(users_by_movie, 2)
    if len(users_by_movie[m1] & users_by_movie[m2]) / total_users >= min_support
}
frequent_triples = {
    frozenset([m1, m2, m3]): len(users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]) / total_users
    for m1, m2, m3 in combinations(users_by_movie, 3)
    if len(users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]) / total_users >= min_support
}

# 4. Definición de la función de extracción de features Apriori
def apriori_features_ultimate(row):
    user = row['userId']
    target = row['movieId']
    rated = set(df[df['userId'] == user]['movieId']) - {target}
    
    sup_target = movie_support.get(target, 0.0)
    rated_count = len(rated)
    
    pair_count = pair_sum = 0
    triple_count = triple_sum = 0
    
    for other in rated:
        pair = frozenset([target, other])
        if pair in frequent_pairs:
            sup = frequent_pairs[pair]
            pair_count += 1
            pair_sum += sup
    
    for combo in combinations(rated, 2):
        tri = frozenset([target, *combo])
        if tri in frequent_triples:
            sup3 = frequent_triples[tri]
            triple_count += 1
            triple_sum += sup3
    
    return pd.Series({
        'sup_target': sup_target,
        'cnt_rated': rated_count,
        'freq_pair_count': pair_count,
        'freq_pair_support_sum': pair_sum,
        'freq_triple_count': triple_count,
        'freq_triple_support_sum': triple_sum
    })

# 5. Aplicar extracción de features Apriori
apr_feats = df.apply(apriori_features_ultimate, axis=1)
df_ml = pd.concat([df, apr_feats], axis=1)

# 6. Preparar X e y
X = df_ml.drop(['userId', 'movieId', 'timestamp', 'rating'], axis=1)
y = df_ml['rating']

# 7. División train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 8. Pipeline: escalado + KNN
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsRegressor(
        n_neighbors=10,
        weights='distance',
        n_jobs=-1
    ))
])

# 9. Entrenamiento y predicción
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

# 10. Evaluación RMSE
rmse_knn = sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE KNN: {rmse_knn:.4f}")


RMSE KNN: 1.9882


## KNN Geners

In [104]:
import pandas as pd
from itertools import combinations
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

# 1. Carga de dataset de ratings
df = pd.read_csv('ratings2comoML.csv')

# 2. Carga de movies.csv con géneros
df_movies = pd.read_csv('movies.csv', usecols=['movieId','genres'], engine='python')
movie_genres = df_movies.set_index('movieId')['genres'].str.split('|').to_dict()

# 3. Pre-cálculo de soporte global y sets de usuarios por película
total_users = df['userId'].nunique()
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
movie_support = {m: len(u) / total_users for m, u in users_by_movie.items()}

# 4. Generar ítems frecuentes de tamaño 2 y 3 (soporte ≥ 0.2)
min_support = 0.2
frequent_pairs = {
    frozenset([m1, m2]): len(users_by_movie[m1] & users_by_movie[m2]) / total_users
    for m1, m2 in combinations(users_by_movie, 2)
    if len(users_by_movie[m1] & users_by_movie[m2]) / total_users >= min_support
}
frequent_triples = {
    frozenset([m1, m2, m3]):
    len(users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]) / total_users
    for m1, m2, m3 in combinations(users_by_movie, 3)
    if len(users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]) / total_users >= min_support
}

# 5. Función para extraer features Apriori + géneros
def extract_features(row):
    user, target = row['userId'], row['movieId']
    rated = set(df[df['userId'] == user]['movieId']) - {target}
    
    sup_target = movie_support.get(target, 0.0)
    cnt_rated = len(rated)
    pair_count = pair_sum = 0
    triple_count = triple_sum = 0
    
    # Pares frecuentes
    for other in rated:
        p = frozenset([target, other])
        if p in frequent_pairs:
            s = frequent_pairs[p]
            pair_count += 1
            pair_sum += s
    
    # Tríos frecuentes
    for combo in combinations(rated, 2):
        t = frozenset([target, *combo])
        if t in frequent_triples:
            s3 = frequent_triples[t]
            triple_count += 1
            triple_sum += s3
    
    feats = {
        'sup_target': sup_target,
        'cnt_rated': cnt_rated,
        'freq_pair_count': pair_count,
        'freq_pair_support_sum': pair_sum,
        'freq_triple_count': triple_count,
        'freq_triple_support_sum': triple_sum
    }
    
    # Features de géneros
    genres = movie_genres.get(target, [])
    feats['num_genres'] = len(genres)
    feats['is_genre_Comedy'] = int('Comedy' in genres)
    feats['is_genre_Drama'] = int('Drama' in genres)
    feats['is_genre_Action'] = int('Action' in genres)
    
    return pd.Series(feats)

# 6. Aplicar extracción y preparar X, y
features = df.apply(extract_features, axis=1)
df_ml = pd.concat([df, features], axis=1)
X = df_ml.drop(['userId', 'movieId', 'timestamp', 'rating'], axis=1)
y = df_ml['rating']

# 7. División train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 8. Pipeline KNN: escalado + KNeighborsRegressor
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsRegressor(n_neighbors=10, weights='distance', n_jobs=1))
])

# 9. Entrenar y predecir
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

# 10. Evaluación RMSE
rmse_knn = sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE KNN (Apriori+Géneros, 10 vars): {rmse_knn:.4f}")


RMSE KNN (Apriori+Géneros, 10 vars): 2.0835


---
# NN

## NN LSTM

In [127]:
import torch
from torch import nn
from torch.utils.data import TensorDataset, DataLoader
import pandas as pd
from itertools import combinations
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.metrics import mean_squared_error
import numpy as np

# 1. Carga del dataset
df = pd.read_csv('ratings2comoML.csv')

# 2. Usuario-por-película y soportes
total_users = df['userId'].nunique()
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
movie_support = {m: len(u) / total_users for m, u in users_by_movie.items()}

# 3. Generar ítems frecuentes de tamaño 2 y 3 (soporte ≥ 0.2)
min_support = 0.2
frequent_pairs = {frozenset({m1, m2}): len(users_by_movie[m1] & users_by_movie[m2]) / total_users
                  for m1, m2 in combinations(users_by_movie.keys(), 2)
                  if len(users_by_movie[m1] & users_by_movie[m2]) / total_users >= min_support}
frequent_triples = {frozenset({m1, m2, m3}):
                    len(users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]) / total_users
                    for m1, m2, m3 in combinations(users_by_movie.keys(), 3)
                    if len(users_by_movie[m1] & users_by_movie[m2] & users_by_movie[m3]) / total_users >= min_support}

# 4. Extracción de features avanzadas
def apriori_features_ultimate(row):
    user = row['userId']
    target = row['movieId']
    rated = set(df[df['userId'] == user]['movieId']) - {target}
    
    # Unarios
    sup_target = movie_support.get(target, 0.0)
    cnt_rated = len(rated)
    
    # Inicializar
    pair_supports = []
    pair_leverages = []
    confs = []
    lifts = []
    weighted_ratings = []
    
    triple_supports = []
    triple_leverages = []
    triple_lifts = []
    
    for other in rated:
        pair = frozenset({target, other})
        if pair in frequent_pairs:
            sup = frequent_pairs[pair]
            pair_supports.append(sup)
            t_sup = sup_target
            o_sup = movie_support.get(other, 0)
            pair_leverages.append(sup - t_sup * o_sup)
            if t_sup > 0:
                confs.append(sup / t_sup)
            if t_sup > 0 and o_sup > 0:
                lifts.append(sup / (t_sup * o_sup))
            # Weighted by support
            rating_other = df[(df['userId'] == user) & (df['movieId'] == other)]['rating'].iloc[0]
            weighted_ratings.append(sup * rating_other)
    
    for combo in combinations(rated, 2):
        triple = frozenset((target,) + combo)
        if triple in frequent_triples:
            sup3 = frequent_triples[triple]
            triple_supports.append(sup3)
            t_sup = sup_target
            o1_sup = movie_support.get(combo[0], 0)
            o2_sup = movie_support.get(combo[1], 0)
            triple_leverages.append(sup3 - t_sup * o1_sup * o2_sup)
            if t_sup > 0 and o1_sup > 0 and o2_sup > 0:
                triple_lifts.append(sup3 / (t_sup * o1_sup * o2_sup))
    
    # Features cálculo
    return pd.Series({
        'sup_target': sup_target,
        'cnt_rated': cnt_rated,
        'freq_pair_count': len(pair_supports),
        'freq_pair_support_sum': sum(pair_supports),
        'max_pair_support': max(pair_supports) if pair_supports else 0.0,
        'min_pair_support': min(pair_supports) if pair_supports else 0.0,
        'avg_pair_support': sum(pair_supports)/len(pair_supports) if pair_supports else 0.0,
        'sum_pair_leverage': sum(pair_leverages),
        'max_pair_leverage': max(pair_leverages) if pair_leverages else 0.0,
        'max_pair_confidence': max(confs) if confs else 0.0,
        'avg_pair_lift': sum(lifts)/len(lifts) if lifts else 0.0,
        'max_pair_lift': max(lifts) if lifts else 0.0,
        'weighted_avg_rating_pair': sum(weighted_ratings)/sum(pair_supports) if pair_supports else 0.0,
        'freq_triple_count': len(triple_supports),
        'freq_triple_support_sum': sum(triple_supports),
        'avg_triple_support': sum(triple_supports)/len(triple_supports) if triple_supports else 0.0,
        'max_triple_support': max(triple_supports) if triple_supports else 0.0,
        'sum_triple_leverage': sum(triple_leverages),
        'max_triple_lift': max(triple_lifts) if triple_lifts else 0.0,
        'avg_triple_lift': sum(triple_lifts)/len(triple_lifts) if triple_lifts else 0.0,
        'triple_coverage': len(triple_supports)/ (cnt_rated*(cnt_rated-1)/2) if cnt_rated > 1 else 0.0
    })

# 5. Generar dataset final
ultimate_feats = df.apply(apriori_features_ultimate, axis=1)
df_ult = pd.concat([df, ultimate_feats], axis=1).drop(['userId','movieId','timestamp'], axis=1)
X = df_ult.drop('rating', axis=1)
y = df_ult['rating']

# 6. Dividir y entrenar modelos
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# 1) Convertir DataFrames a numpy float32
X_train_np = X_train.to_numpy(dtype=np.float32)
y_train_np = y_train.to_numpy(dtype=np.float32)
X_test_np  = X_test.to_numpy(dtype=np.float32)
y_test_np  = y_test.to_numpy(dtype=np.float32)

# 2) Crear tensores y añadir dimensión de “feature”
#    De (N, seq_len) a (N, seq_len, 1)
X_train_t = torch.from_numpy(X_train_np).unsqueeze(2)
y_train_t = torch.from_numpy(y_train_np)
X_test_t  = torch.from_numpy(X_test_np).unsqueeze(2)
y_test_t  = torch.from_numpy(y_test_np)

# 3) DataLoader
batch_size = 512
train_ds   = TensorDataset(X_train_t, y_train_t)
train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True)

# 4) Definir el modelo RNN (LSTM + lineal)
class RNNRegressor(nn.Module):
    def __init__(self, seq_len, hidden_size=64, n_layers=1):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=1,
            hidden_size=hidden_size,
            num_layers=n_layers,
            batch_first=True,
        )
        self.fc = nn.Linear(hidden_size, 1)

    def forward(self, x):
        # x: (batch, seq_len, 1)
        out, _ = self.lstm(x)               # (batch, seq_len, hidden_size)
        last = out[:, -1, :]                # tomar salida del último paso
        return self.fc(last).squeeze(1)     # (batch,)

# 5) Preparar dispositivo y modelo
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
seq_len = X_train_t.shape[1]
model   = RNNRegressor(seq_len=seq_len, hidden_size=64).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

# 6) Entrenamiento
n_epochs = 30
for epoch in range(1, n_epochs+1):
    model.train()
    running_loss = 0.0
    for xb, yb in train_loader:
        xb, yb = xb.to(device), yb.to(device)
        optimizer.zero_grad()
        preds = model(xb)
        loss = criterion(preds, yb)
        loss.backward()
        optimizer.step()
        running_loss += loss.item() * xb.size(0)
    mse_train = running_loss / len(train_loader.dataset)
    print(f"Epoch {epoch:02d} — MSE train: {mse_train:.4f}")

# 7) Evaluación en test
model.eval()
with torch.no_grad():
    preds_test = model(X_test_t.to(device)).cpu().numpy()
rmse_rnn = sqrt(mean_squared_error(y_test_np, preds_test))
print(f"\nRMSE RNN (LSTM): {rmse_rnn:.4f}")
len(X.columns)

Epoch 01 — MSE train: 17.3046
Epoch 02 — MSE train: 17.0984
Epoch 03 — MSE train: 16.8921
Epoch 04 — MSE train: 16.6835
Epoch 05 — MSE train: 16.4706
Epoch 06 — MSE train: 16.2515
Epoch 07 — MSE train: 16.0243
Epoch 08 — MSE train: 15.7869
Epoch 09 — MSE train: 15.5369
Epoch 10 — MSE train: 15.2715
Epoch 11 — MSE train: 14.9870
Epoch 12 — MSE train: 14.6791
Epoch 13 — MSE train: 14.3425
Epoch 14 — MSE train: 13.9706
Epoch 15 — MSE train: 13.5556
Epoch 16 — MSE train: 13.0880
Epoch 17 — MSE train: 12.5573
Epoch 18 — MSE train: 11.9518
Epoch 19 — MSE train: 11.2600
Epoch 20 — MSE train: 10.4727
Epoch 21 — MSE train: 9.5874
Epoch 22 — MSE train: 8.6155
Epoch 23 — MSE train: 7.5891
Epoch 24 — MSE train: 6.5652
Epoch 25 — MSE train: 5.6150
Epoch 26 — MSE train: 4.8019
Epoch 27 — MSE train: 4.1619
Epoch 28 — MSE train: 3.6939
Epoch 29 — MSE train: 3.3692
Epoch 30 — MSE train: 3.1540

RMSE RNN (LSTM): 1.7910


21

## NN MLP

In [106]:
import pandas as pd
from itertools import combinations
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error

# 1. Carga del dataset
df = pd.read_csv('ratings2comoML.csv')

# 2. Usuario-por-película y soportes
total_users = df['userId'].nunique()
users_by_movie = df.groupby('movieId')['userId'].apply(set).to_dict()
movie_support   = {m: len(u)/total_users for m,u in users_by_movie.items()}

# 3. Ítems frecuentes (pares y tríos)
min_support = 0.2
frequent_pairs = {
    frozenset([m1,m2]): len(users_by_movie[m1]&users_by_movie[m2]) / total_users
    for m1,m2 in combinations(users_by_movie,2)
    if len(users_by_movie[m1]&users_by_movie[m2]) / total_users >= min_support
}
frequent_triples = {
    frozenset([m1,m2,m3]):
      len(users_by_movie[m1]&users_by_movie[m2]&users_by_movie[m3]) / total_users
    for m1,m2,m3 in combinations(users_by_movie,3)
    if len(users_by_movie[m1]&users_by_movie[m2]&users_by_movie[m3]) / total_users >= min_support
}

# 4. Función de extracción de las 21 features Apriori avanzadas
def apriori_features_ultimate(row):
    user   = row['userId']
    target = row['movieId']
    rated  = set(df[df['userId']==user]['movieId']) - {target}

    sup_t = movie_support.get(target, 0.0)
    pc = ps = tc = ts = 0

    for other in rated:
        p = frozenset([target, other])
        if p in frequent_pairs:
            s = frequent_pairs[p]
            pc += 1; ps += s

    for combo in combinations(rated,2):
        t = frozenset([target,*combo])
        if t in frequent_triples:
            s3 = frequent_triples[t]
            tc += 1; ts += s3

    return pd.Series({
        'sup_target': sup_t,
        'cnt_rated': len(rated),
        'freq_pair_count': pc,
        'freq_pair_support_sum': ps,
        'freq_triple_count': tc,
        'freq_triple_support_sum': ts,
        'max_pair_support': max((frequent_pairs[frozenset([target,o])] 
                                 for o in rated if frozenset([target,o]) in frequent_pairs), default=0.0),
        'avg_pair_support': (ps/pc) if pc>0 else 0.0,
        'sum_pair_leverage': sum((frequent_pairs[frozenset([target,o])] -
                                  sup_t * movie_support[o])
                                 for o in rated if frozenset([target,o]) in frequent_pairs),
        'max_pair_confidence': max(((frequent_pairs[frozenset([target,o])] / sup_t)
                                    for o in rated if sup_t>0 and frozenset([target,o]) in frequent_pairs),
                                   default=0.0),
        'avg_pair_lift': sum((frequent_pairs[frozenset([target,o])] /
                              (sup_t*movie_support[o]))
                             for o in rated if sup_t>0 and movie_support[o]>0 and frozenset([target,o]) in frequent_pairs)
                          / max(pc,1),
        'freq_triple_support_sum': ts,
        'triple_coverage': tc / max((len(rated)*(len(rated)-1)/2),1)
    })

# 5. Generar dataset de features
ult_feats = df.apply(apriori_features_ultimate, axis=1)
df_ml = pd.concat([df, ult_feats], axis=1).drop(['userId','movieId','timestamp'], axis=1)

X = df_ml.drop('rating', axis=1)
y = df_ml['rating']

# 6. División train/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# 7. Pipeline: escalado + MLPRegressor
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('mlp',    MLPRegressor(
                   hidden_layer_sizes=(100,),
                   activation='relu',
                   solver='adam',
                   max_iter=200,
                   random_state=42
               ))
])

# 8. Entrenamiento y evaluación
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
rmse_nn = sqrt(mean_squared_error(y_test, y_pred))

print(f"RMSE Neural Network: {rmse_nn:.4f}")


RMSE Neural Network: 1.6304


## RNN

In [132]:
import pandas as pd
from itertools import combinations
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import SimpleRNN, Dense

# 1. Carga de datos
df_ratings = pd.read_csv('ratings2comoML.csv')
df_movies  = pd.read_csv('movies.csv', usecols=['movieId','genres'])
movie_genres = df_movies.set_index('movieId')['genres'].str.split('|').to_dict()

# 2. Pre-cálculo de soportes
total_users = df_ratings['userId'].nunique()
users_by_movie = df_ratings.groupby('movieId')['userId'].apply(set).to_dict()
movie_support = {m: len(u)/total_users for m,u in users_by_movie.items()}

# 3. Ítems frecuentes (pares y tríos)
min_support = 0.2
frequent_pairs = {
    frozenset([m1,m2]): len(users_by_movie[m1]&users_by_movie[m2]) / total_users
    for m1,m2 in combinations(users_by_movie,2)
    if len(users_by_movie[m1]&users_by_movie[m2]) / total_users >= min_support
}
frequent_triples = {
    frozenset([m1,m2,m3]):
      len(users_by_movie[m1]&users_by_movie[m2]&users_by_movie[m3]) / total_users
    for m1,m2,m3 in combinations(users_by_movie,3)
    if len(users_by_movie[m1]&users_by_movie[m2]&users_by_movie[m3]) / total_users >= min_support
}

# 4. Extracción de las 10 features
def extract_features(row):
    user, target = row['userId'], row['movieId']
    rated = set(df_ratings[df_ratings['userId']==user]['movieId']) - {target}
    sup_t = movie_support.get(target,0.0)
    cnt_r = len(rated)
    pc = ps = tc = ts = 0
    for other in rated:
        p = frozenset([target,other])
        if p in frequent_pairs:
            s = frequent_pairs[p]
            pc += 1; ps += s
    for combo in combinations(rated,2):
        t = frozenset([target,*combo])
        if t in frequent_triples:
            s3 = frequent_triples[t]
            tc += 1; ts += s3
    feats = {
        'sup_target': sup_t,
        'cnt_rated': cnt_r,
        'freq_pair_count': pc,
        'freq_pair_support_sum': ps,
        'freq_triple_count': tc,
        'freq_triple_support_sum': ts,
    }
    genres = movie_genres.get(target, [])
    feats['num_genres']   = len(genres)
    feats['is_Comedy']    = int('Comedy' in genres)
    feats['is_Drama']     = int('Drama' in genres)
    feats['is_Thriller']  = int('Thriller' in genres)
    return pd.Series(feats)

feature_df = df_ratings.apply(extract_features, axis=1)
X = feature_df.values
y = df_ratings['rating'].values

# 5. Escalado y reshape para RNN (timesteps=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_seq = X_scaled.reshape((X_scaled.shape[0], 1, X_scaled.shape[1]))

# 6. Train/test
X_train, X_test, y_train, y_test = train_test_split(
    X_seq, y, test_size=0.2, random_state=42
)

# 7. Definir y entrenar la RNN
n_features = X_seq.shape[2]
model = Sequential([
    SimpleRNN(64, activation='relu', input_shape=(1, n_features)),
    Dense(1)
])
model.compile(optimizer='adam', loss='mse')

# Ajusta epochs/batch_size según tu máquina
model.fit(X_train, y_train, epochs=10, batch_size=32, verbose=1)

# 8. Evaluación
y_pred = model.predict(X_test).flatten()
rmse_rnn = sqrt(mean_squared_error(y_test, y_pred))
print(f"RMSE RNN: {rmse_rnn:.4f}")

# 9. Número de features
print(f"Numero de features: {X.shape[1]}")

Epoch 1/10
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 14ms/step - loss: 18.2143 
Epoch 2/10
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - loss: 17.7385
Epoch 3/10
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - loss: 16.8936
Epoch 4/10
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - loss: 16.8628
Epoch 5/10
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - loss: 16.6916
Epoch 6/10
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - loss: 15.8470
Epoch 7/10
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - loss: 16.2378
Epoch 8/10
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 15ms/step - loss: 15.7300
Epoch 9/10
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step - loss: 14.6093
Epoch 10/10
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 13ms/step - loss: 14.6057
[1m1/1