# Testing Recommender Functions

This notebook installs the required dependencies and tests the recommender functions defined in your modules (e.g. `utils.py` and `recommender.py`). It loads the models via the `get_models()` function and then tests various ensemble methods.

In [1]:
# Install required dependencies
#!pip install --upgrade pip
#!pip install numpy scikit-learn tensorflow keras fastapi torch transformers

# If your project has a requirements.txt file, you can also use:
# !pip install --no-cache-dir -r requirements.txt

In [2]:
from utils import get_models, run_experiments_modular
models_dict, news_df, behaviors_df, tokenizer = get_models()
    
# Fit a TF-IDF vectorizer on the news combined text.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(news_df["CombinedText"])

results_df = run_experiments_modular(
    behaviors_df, news_df, models_dict, tokenizer,
    tfidf_vectorizer=tfidf_vectorizer,
    max_candidates=-1, test_user_count=1, n_dates=3,
    timeframe_hours=24*10, k=100, ensemble_method="bagging",
    min_tfidf_similarity=0.02
    )

2025-03-23 23:57:17.127063: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-03-23 23:57:17.162306: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1742774237.187198   64037 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1742774237.194479   64037 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-23 23:57:17.234173: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

Loaded news data:
   NewsID   Category               SubCategory  \
0  N88753  lifestyle           lifestyleroyals   
1  N45436       news  newsscienceandtechnology   
2  N23144     health                weightloss   
3  N86255     health                   medical   
4  N93187       news                 newsworld   

                                               Title  \
0  The Brands Queen Elizabeth, Prince Charles, an...   
1    Walmart Slashes Prices on Last-Generation iPads   
2                      50 Worst Habits For Belly Fat   
3  Dispose of unwanted prescription drugs during ...   
4  The Cost of Trump's Aid Freeze in the Trenches...   

                                            Abstract  \
0  Shop the notebooks, jackets, and more that the...   
1  Apple's new iPad releases bring big deals on l...   
2  These seemingly harmless habits are holding yo...   
3                                                NaN   
4  Lt. Ivan Molchanets peeked over a parapet of s...   

       

2025-03-23 23:58:28.708046: E external/local_xla/xla/stream_executor/cuda/cuda_driver.cc:152] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
  saveable.load_own_variables(weights_store.get(inner_path))



Loading model for Cluster 1 from fastformer_cluster_1_full_balanced_1_epoch.keras
.cache
.ipynb_checkpoints
.Trash-0
backend copy 2.py
backend copy.py
backend-flask-unused.py
backend.py
data
dataset
detailed_log.log
Dockerfile
downloads
experiment_results.csv
fastapi copy.py
fastapi2.py
fastformer.json
fastformer_clusters.ipynb
fastformer_cluster_0_full_balanced_1_epoch.h5
fastformer_cluster_0_full_balanced_1_epoch.json
fastformer_cluster_0_full_balanced_1_epoch.keras
fastformer_cluster_0_full_balanced_1_epoch.weights.h5
fastformer_cluster_1_full_balanced_1_epoch.keras
fastformer_cluster_2_full_balanced_1_epoch.keras
fastformer_model.py
gdrive.py
models
models.py
my_log.py
recommender.py
requirements.txt
results
results.csv
stage1_candidates.csv
test_recommender.ipynb
test_recommender.py
tfidf_clicked_articles.csv
tokenizer.pkl
upload_to_hf.py
user_category_profiles.pkl
utils copy.py
utils.py
__pycache__
2.18.0
3.8.0

Loading model for Cluster 2 from fastformer_cluster_2_full_balanced

  cutoff_time = np.datetime64(current_date - timedelta(hours=candidate_timeframe_hours))
  ref_dt64 = np.datetime64(current_date)


recent_behaviors:        ImpressionID   UserID       Time  \
472830        472831  U403465 2019-11-09   

                                              HistoryText  \
472830  N59850 N104930 N68866 N82374 N123325 N127916 N...   

                                              Impressions  \
472830  N92613-0 N17456-0 N67369-0 N31486-0 N76810-0 N...   

                                        HistoryCategories        Date  
472830  [news, news, tv, news, tv, news, news, news, n...  2019-11-09  
Cutoff date (current_date - 24 hours): 2019-11-08T00:00:00.000000
history_text:
history_ids:[]
TF-IDF similarity distribution plotted and saved as 'tfidf_similarity_distribution.png'.
For user:U142719, row13833:N33037-0 N60032-0 N96373-0 N119637-0 N25037-0 N22659-0 N27251-0 N47257-0 N21819-0 N100900-0 N101119-0 N15137-0 N99484-0 N22597-0 N44558-0 N74438-0 N107637-0 N53225-0 N65552-0 N19455-0 N75391-1 N39784-0 N79094-0 N94356-0 N93674-0 N92699-0 N84137-0 N21883-0 N11682-0 N88148-0 N3075-0 N8719-0 N93

  ref_dt64 = np.datetime64(after_time)
  ref_dt64 = np.datetime64(after_time)


For user:U142719, row13833:N33037-0 N60032-0 N96373-0 N119637-0 N25037-0 N22659-0 N27251-0 N47257-0 N21819-0 N100900-0 N101119-0 N15137-0 N99484-0 N22597-0 N44558-0 N74438-0 N107637-0 N53225-0 N65552-0 N19455-0 N75391-1 N39784-0 N79094-0 N94356-0 N93674-0 N92699-0 N84137-0 N21883-0 N11682-0 N88148-0 N3075-0 N8719-0 N93643-0 N56390-0 N98478-0 N9720-0 N71368-0 N99498-0 N106539-0 N122726-0 N60829-0 N24798-0 N12586-0 N114546-0 N39197-0 N83421-0 N85266-0 N76489-0 N90822-0 N58417-0 N81064-0 N28925-0 N93411-0 N67068-0 N17575-0 N70822-0 N123077-0 N55792-0 N95135-0 N77310-0 N1928-0 N91757-0 N115772-0 N100343-0 N102394-0 N58258-0 N79081-0 N12453-0 N122819-0 N116064-0 N85986-0 N11296-0 N91892-0
For user:U142719, row666425:N67937-0 N84720-0 N65446-0 N51569-0 N65017-0 N35236-0 N100456-0 N110627-0 N106659-0 N104161-0 N17475-0 N1923-0 N42649-0 N880-0 N46223-0 N97912-0 N80784-0 N16321-0 N20871-0 N102499-0 N90184-0 N56873-0 N34172-0 N92300-0 N84430-0 N51163-0 N92199-0 N76665-0 N29544-0 N16384-0 N105407

  cutoff_time = np.datetime64(current_date - timedelta(hours=candidate_timeframe_hours))
  ref_dt64 = np.datetime64(current_date)


recent_behaviors:         ImpressionID   UserID                Time  \
0                   1   U87243 2019-11-10 11:30:54   
9                  10  U290933 2019-11-10 11:54:34   
45                 46  U595088 2019-11-10 18:26:21   
56                 57  U184159 2019-11-10 09:57:00   
88                 89  U503349 2019-11-10 07:32:53   
...               ...      ...                 ...   
2232692       2232693  U248404 2019-11-10 16:46:31   
2232695       2232696  U670414 2019-11-10 19:23:33   
2232707       2232708  U557401 2019-11-10 10:47:01   
2232730       2232731  U132053 2019-11-10 17:47:08   
2232736       2232737  U215047 2019-11-10 06:16:42   

                                               HistoryText  \
0        N8668 N39081 N65259 N79529 N73408 N43615 N2937...   
9        N14678 N71340 N65259 N92085 N31043 N70385 N123...   
45                     N60457 N100033 N84031 N97971 N19175   
56                                   N126052 N54360 N51166   
88       N35249 N20138 N

  ref_dt64 = np.datetime64(after_time)
  ref_dt64 = np.datetime64(after_time)


For user:U142719, row13833:N33037-0 N60032-0 N96373-0 N119637-0 N25037-0 N22659-0 N27251-0 N47257-0 N21819-0 N100900-0 N101119-0 N15137-0 N99484-0 N22597-0 N44558-0 N74438-0 N107637-0 N53225-0 N65552-0 N19455-0 N75391-1 N39784-0 N79094-0 N94356-0 N93674-0 N92699-0 N84137-0 N21883-0 N11682-0 N88148-0 N3075-0 N8719-0 N93643-0 N56390-0 N98478-0 N9720-0 N71368-0 N99498-0 N106539-0 N122726-0 N60829-0 N24798-0 N12586-0 N114546-0 N39197-0 N83421-0 N85266-0 N76489-0 N90822-0 N58417-0 N81064-0 N28925-0 N93411-0 N67068-0 N17575-0 N70822-0 N123077-0 N55792-0 N95135-0 N77310-0 N1928-0 N91757-0 N115772-0 N100343-0 N102394-0 N58258-0 N79081-0 N12453-0 N122819-0 N116064-0 N85986-0 N11296-0 N91892-0
For user:U142719, row666425:N67937-0 N84720-0 N65446-0 N51569-0 N65017-0 N35236-0 N100456-0 N110627-0 N106659-0 N104161-0 N17475-0 N1923-0 N42649-0 N880-0 N46223-0 N97912-0 N80784-0 N16321-0 N20871-0 N102499-0 N90184-0 N56873-0 N34172-0 N92300-0 N84430-0 N51163-0 N92199-0 N76665-0 N29544-0 N16384-0 N105407

  cutoff_time = np.datetime64(current_date - timedelta(hours=candidate_timeframe_hours))
  ref_dt64 = np.datetime64(current_date)


history_text:
history_ids:[]
TF-IDF similarity distribution plotted and saved as 'tfidf_similarity_distribution.png'.
For user:U142719, row666425:N67937-0 N84720-0 N65446-0 N51569-0 N65017-0 N35236-0 N100456-0 N110627-0 N106659-0 N104161-0 N17475-0 N1923-0 N42649-0 N880-0 N46223-0 N97912-0 N80784-0 N16321-0 N20871-0 N102499-0 N90184-0 N56873-0 N34172-0 N92300-0 N84430-0 N51163-0 N92199-0 N76665-0 N29544-0 N16384-0 N105407-0 N102458-0 N91865-0 N50175-0 N114449-1 N29441-0 N57001-0 N97460-1 N25814-1 N123968-0 N76209-0 N120147-0 N57497-0 N74682-0
For user:U142719, row749027:N32419-0 N123683-0 N44077-0 N130076-0 N77503-0 N79817-1 N79480-0 N99184-0 N16161-0 N89112-0 N107472-0 N19635-0 N83707-0 N29544-0 N50175-0 N17475-0 N48841-0 N60268-0 N94108-0 N100425-0 N34424-0 N65446-0 N46716-0 N114057-0 N74401-0 N99964-0 N33901-0 N19834-0 N117698-0 N40795-0 N102304-0 N33539-0 N89701-0 N48205-0 N4360-0 N59288-0 N84706-0 N73295-0 N95341-0 N87236-0 N49048-0 N103810-0 N102426-0 N4371-0 N103133-0 N118623-0 

  ref_dt64 = np.datetime64(after_time)
  ref_dt64 = np.datetime64(after_time)


For user:U142719, row666425:N67937-0 N84720-0 N65446-0 N51569-0 N65017-0 N35236-0 N100456-0 N110627-0 N106659-0 N104161-0 N17475-0 N1923-0 N42649-0 N880-0 N46223-0 N97912-0 N80784-0 N16321-0 N20871-0 N102499-0 N90184-0 N56873-0 N34172-0 N92300-0 N84430-0 N51163-0 N92199-0 N76665-0 N29544-0 N16384-0 N105407-0 N102458-0 N91865-0 N50175-0 N114449-1 N29441-0 N57001-0 N97460-1 N25814-1 N123968-0 N76209-0 N120147-0 N57497-0 N74682-0
For user:U142719, row749027:N32419-0 N123683-0 N44077-0 N130076-0 N77503-0 N79817-1 N79480-0 N99184-0 N16161-0 N89112-0 N107472-0 N19635-0 N83707-0 N29544-0 N50175-0 N17475-0 N48841-0 N60268-0 N94108-0 N100425-0 N34424-0 N65446-0 N46716-0 N114057-0 N74401-0 N99964-0 N33901-0 N19834-0 N117698-0 N40795-0 N102304-0 N33539-0 N89701-0 N48205-0 N4360-0 N59288-0 N84706-0 N73295-0 N95341-0 N87236-0 N49048-0 N103810-0 N102426-0 N4371-0 N103133-0 N118623-0 N880-0 N80770-0 N4384-0 N28930-0 N20250-0 N64238-0 N1923-0 N102845-0 N30955-0 N97743-0 N33378-0
For user:U142719, row7

In [3]:
fail here!!

SyntaxError: invalid syntax (2964736887.py, line 1)

In [None]:



from utils import get_models, run_experiments
models_dict, news_df, behaviors_df, tokenizer = get_models()
user_data = behaviors_df[behaviors_df['UserID'] == "U379278"]
print(user_data[['Time', 'Impressions']])

run_experiments(behaviors_df, models_dict, -1, 1, 5, 24*10, 20, filter_method="both", min_tfidf_similarity=0.02)
#def run_experiments(behaviors_df, models_dict, max_candidates=-1, test_user_count=1, n_dates=3, timeframe_hours=1, k=5, filter_method="both", min_tfidf_similarity=0.02, tfidf_vectorizer=None):


In [None]:
thresholds = np.linspace(0.0, 1.0, 21)
performance = []

for thr in thresholds:
    filtered_candidates = [score for score in candidate_similarities if score >= thr]
    # Here, compute your chosen metric (e.g., F1 or Precision@K) using the filtered candidates.
    metric_value = compute_metric(filtered_candidates, ground_truth_clicks)
    performance.append(metric_value)

plt.plot(thresholds, performance, marker='o')
plt.xlabel("TF‑IDF Similarity Threshold")
plt.ylabel("Metric (e.g., F1‑score)")
plt.title("Threshold Tuning for Candidate Filtering")
plt.show()

In [None]:
# Import required modules
import numpy as np
from utils import get_models  # Ensure these are in your PYTHONPATH
from recommender import ensemble_bagging, ensemble_boosting, train_stacking_meta_model, ensemble_stacking, hybrid_ensemble, tokenize_input

# For demonstration, we assume get_models() returns a dictionary of models for clusters 0, 1, 2, etc.
print("Modules imported successfully.")

In [None]:
import os
import pandas as pd
import numpy as np
import pickle
from tensorflow.keras.preprocessing.sequence import pad_sequences
import sys
import tensorflow as tf

# Remap standalone keras modules to tensorflow.keras
sys.modules["keras.preprocessing.text"] = tf.keras.preprocessing.text
sys.modules["keras.preprocessing.sequence"] = tf.keras.preprocessing.sequence
sys.modules["keras.utils"] = tf.keras.utils

news_file = "news.tsv"
behaviors_file = "behaviors.tsv"
data_dir = 'dataset/train/'  # Adjust path as necessary
valid_data_dir = 'dataset/valid/'  # Adjust path as necessary
news_path = os.path.join(data_dir, news_file)
behaviors_path = os.path.join(data_dir, behaviors_file)

# Set maximum lengths (should match your model settings)
max_history_length = 50
max_title_length = 30

# Load the pre-saved tokenizer (assumes you already created and saved it)
with open('tokenizer.pkl', 'rb') as f:
    tokenizer = pickle.load(f)

# Load MIND test data (adjust file paths as necessary)
# Assume news.tsv contains columns: NewsID, Category, SubCategory, Title, Abstract, URL, TitleEntities, AbstractEntities
news_df = pd.read_csv(news_path, sep='\t', 
                      names=['NewsID', 'Category', 'SubCategory', 'Title', 'Abstract', 'URL', 'TitleEntities', 'AbstractEntities'])
# Assume behaviors_test.tsv contains: ImpressionID, UserID, Time, HistoryText, Impressions
behaviors_df = pd.read_csv(behaviors_path, sep='\t', 
                           names=['ImpressionID', 'UserID', 'Time', 'HistoryText', 'Impressions'])

# Create a dictionary mapping NewsID to Title (or CombinedText if available)
news_dict = dict(zip(news_df['NewsID'], news_df['Title']))

# Select one sample from the test behaviors
sample = behaviors_df.iloc[0]

# Process history: split the HistoryText (a space-separated string of NewsIDs)
history_text = sample['HistoryText']
history_ids = history_text.split() if pd.notna(history_text) else []

# Retrieve the title for each news ID in the history (default to empty string if missing)
history_titles = [news_dict.get(nid, "") for nid in history_ids]

# Convert history titles to sequences using the tokenizer
history_sequences = tokenizer.texts_to_sequences(history_titles)
# Pad each sequence to max_title_length
history_padded = pad_sequences(history_sequences, maxlen=max_title_length, 
                               padding='post', truncating='post', value=0)

# Ensure the history has exactly max_history_length rows:
if history_padded.shape[0] < max_history_length:
    # Pre-pad with zeros if there are fewer history items
    pad_rows = np.zeros((max_history_length - history_padded.shape[0], max_title_length), dtype=int)
    history_padded = np.vstack([pad_rows, history_padded])
else:
    # If too many, take the last max_history_length items
    history_padded = history_padded[-max_history_length:]

# Process candidate: the "Impressions" column is a space-separated list like "newsID-label newsID-label ..."
impressions = sample['Impressions']
first_candidate = impressions.split()[0]  # take the first candidate
candidate_news_id = first_candidate.split('-')[0]
candidate_title = news_dict.get(candidate_news_id, "")
candidate_sequence = tokenizer.texts_to_sequences([candidate_title])
candidate_padded = pad_sequences(candidate_sequence, maxlen=max_title_length, 
                                 padding='post', truncating='post', value=0)[0]

# Convert to TensorFlow tensors
history_tensor = tf.convert_to_tensor([history_padded], dtype=tf.int32)  # shape: (1, max_history_length, max_title_length)
candidate_tensor = tf.convert_to_tensor([candidate_padded], dtype=tf.int32)  # shape: (1, max_title_length)

print("History tensor shape:", history_tensor.shape)
print("Candidate tensor shape:", candidate_tensor.shape)

# Load ensemble models using the get_models function
print("Loading models...")
models_dict = get_models()
print("Models loaded:", models_dict.keys())

# Test ensemble bagging
bagging_pred = ensemble_bagging(history_tensor, candidate_tensor, models_dict)
print("Ensemble Bagging Prediction:", bagging_pred)

# Test ensemble boosting with dummy error values
dummy_errors = np.array([0.2, 0.15, 0.25])
boosting_pred = ensemble_boosting(history_tensor, candidate_tensor, models_dict, dummy_errors)
print("Ensemble Boosting Prediction:", boosting_pred)

# Test ensemble stacking with dummy training data
X_train_dummy = np.array([
    [0.80, 0.75, 0.85],
    [0.55, 0.60, 0.50],
    [0.30, 0.35, 0.25],
    [0.20, 0.25, 0.15]
])
y_train_dummy = np.array([1, 0, 1, 0])
meta_model = train_stacking_meta_model(X_train_dummy, y_train_dummy)
stacking_pred = ensemble_stacking(history_tensor, candidate_tensor, models_dict, meta_model)
print("Ensemble Stacking Prediction:", stacking_pred)

# Test hybrid ensemble
hybrid_pred = hybrid_ensemble(history_tensor, candidate_tensor, models_dict, dummy_errors, meta_model)
print("Hybrid Ensemble Prediction:", hybrid_pred)

### Next Steps

You can now develop and test your recommendation functions independently of the FastAPI backend. 

For further debugging, you might want to add additional print statements or assertions within your recommender functions.