# Proof-of-Concept: Simple Article Recommendation - Model Development

This notebook demonstrates a basic prototype implementation of a content recommendation system for the NutriGenius proof-of-concept.

## Table of Contents
1. [Introduction](#introduction)
2. [Setup](#setup)
3. [Loading Processed Data](#loading-data)
4. [Feature Engineering](#feature-engineering)
5. [Model Development](#model-development)
6. [Model Evaluation](#evaluation)
7. [Personalization Features](#personalization)
8. [Model Export](#export)
9. [Conclusion](#conclusion)

## 1. Introduction

This notebook implements a simplified article recommendation prototype for the NutriGenius proof-of-concept. We intentionally use basic approaches that:

1. Require minimal data preparation (works with our synthetic dataset)
2. Implement classical TF-IDF and LSA rather than complex neural approaches
3. Provide simple but effective content-based recommendations
4. Can be easily integrated into a prototype mobile application

The goal is to demonstrate that even with these simplifications, we can deliver reasonable article recommendations based on user characteristics and interests. This approach is well-suited for a proof-of-concept while requiring minimal development time and resources.

> **Note**: This prototype intentionally uses simplified algorithms instead of state-of-the-art recommendation techniques to focus on demonstrating the concept with minimal complexity.

## 2. Setup

In [None]:
import os
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Normalizer
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
import re
from tqdm.notebook import tqdm
import joblib
import random
from collections import Counter

# Set random seeds for reproducibility
np.random.seed(42)
random.seed(42)

# Configure plots
plt.style.use('seaborn-whitegrid')
sns.set_context('notebook')

# Add project root to path
import os
sys.path.append(os.path.abspath(os.path.join(os.path.dirname("__file__"), '../..')))

# Import utility functions
from src.utils.common import load_config, create_directory
from src.utils.data_processing import clean_text, tokenize_text
from src.article_recommender import build_recommendation_model, get_recommendations

In [None]:
# Load configuration
CONFIG_PATH = os.path.abspath(os.path.join(os.path.dirname("__file__"), '../../config/model_config.yaml'))
config = load_config(CONFIG_PATH)

# Extract relevant configuration
article_config = config['article_recommender']
dataset_config = config['dataset']['articles']
model_path_config = config['model_paths']['article_recommender']

# Define paths from config
ARTICLES_DATA_FILE = dataset_config['data_file']
ARTICLES_PROCESSED_FILE = dataset_config['processed_file']
RECOMMENDER_MODEL_PATH = model_path_config['model']

# Create necessary directories
create_directory(os.path.dirname(ARTICLES_DATA_FILE))
create_directory(os.path.dirname(ARTICLES_PROCESSED_FILE))
create_directory(os.path.dirname(RECOMMENDER_MODEL_PATH))

## 3. Loading Processed Data

First, let's load the processed article data from the EDA notebook:

In [None]:
# Check if processed data exists
if os.path.exists(ARTICLES_PROCESSED_FILE):
    print(f"Loading processed articles from {ARTICLES_PROCESSED_FILE}")
    articles_df = pd.read_csv(ARTICLES_PROCESSED_FILE)
else:
    # If processed data doesn't exist, load raw data
    if os.path.exists(ARTICLES_DATA_FILE):
        print(f"Loading raw articles from {ARTICLES_DATA_FILE}")
        articles_df = pd.read_csv(ARTICLES_DATA_FILE)
        
        # Process the data (similar to EDA notebook)
        # ... preprocessing code would go here
    else:
        # If neither exists, create sample dataset
        from src.article_recommender import create_sample_article_dataset
        print("Creating sample article dataset")
        create_sample_article_dataset(ARTICLES_DATA_FILE, num_articles=200)
        articles_df = pd.read_csv(ARTICLES_DATA_FILE)
        
        # Process the data (similar to EDA notebook)
        # ... preprocessing code would go here

print(f"Loaded dataset with {len(articles_df)} articles")
articles_df.head()

In [None]:
# Check if we have the processed text columns already
if 'processed_content' not in articles_df.columns:
    print("Processing text data...")
    # Define text preprocessing function
    def preprocess_text(text, remove_stopwords=True, lemmatize=True):
        # Convert to lowercase
        text = text.lower()
        
        # Tokenize
        tokens = word_tokenize(text)
        
        # Remove stopwords
        if remove_stopwords:
            stop_words = set(stopwords.words('english'))
            tokens = [token for token in tokens if token not in stop_words]
        
        # Lemmatization
        if lemmatize:
            lemmatizer = WordNetLemmatizer()
            tokens = [lemmatizer.lemmatize(token) for token in tokens]
        
        # Rejoin into string
        processed_text = ' '.join(tokens)
        
        return processed_text
    
    # Apply preprocessing to article contents
    articles_df['processed_content'] = articles_df['content'].apply(
        lambda x: preprocess_text(x, remove_stopwords=True, lemmatize=True)
    )
    
    # Process article titles
    articles_df['processed_title'] = articles_df['title'].apply(
        lambda x: preprocess_text(x, remove_stopwords=True, lemmatize=True)
    )
    
    # Process article tags
    articles_df['processed_tags'] = articles_df['tags'].apply(
        lambda x: preprocess_text(x.replace(',', ' '), remove_stopwords=False, lemmatize=True)
    )
    
    # Combine all processed text fields
    articles_df['combined_text'] = (
        articles_df['processed_title'] + ' ' + 
        articles_df['processed_content'] + ' ' + 
        articles_df['processed_tags'] + ' ' + 
        articles_df['category'].apply(lambda x: x.lower().replace(' ', '_'))
    )

## 4. Feature Engineering

Let's create numerical representations of our text data.

In [None]:
# Initialize TF-IDF vectorizer with parameters from config
tfidf_vectorizer = TfidfVectorizer(
    max_features=article_config['max_features'],
    lowercase=article_config['preprocessing']['lowercase'],
    stop_words='english' if article_config['preprocessing']['remove_stopwords'] else None
)

# Fit and transform the combined text
tfidf_matrix = tfidf_vectorizer.fit_transform(articles_df['combined_text'])

print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"Number of features: {len(tfidf_vectorizer.get_feature_names_out())}")

In [None]:
# Apply dimensionality reduction using LSA (Latent Semantic Analysis)
n_components = min(100, tfidf_matrix.shape[1] - 1)
svd = TruncatedSVD(n_components=n_components, random_state=42)
lsa_matrix = svd.fit_transform(tfidf_matrix)

print(f"LSA matrix shape: {lsa_matrix.shape}")
print(f"Explained variance ratio: {svd.explained_variance_ratio_.sum():.4f}")

# Plot the explained variance
plt.figure(figsize=(10, 5))
plt.plot(range(1, n_components + 1), svd.explained_variance_ratio_.cumsum())
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Explained Variance by LSA Components')
plt.grid(True)
plt.show()

## 5. Model Development

Now, let's develop our recommendation model.

In [None]:
# Compute similarity matrix using LSA features
lsa_similarity = cosine_similarity(lsa_matrix)
print(f"LSA similarity matrix shape: {lsa_similarity.shape}")

# Also keep the TF-IDF similarity for comparison
tfidf_similarity = cosine_similarity(tfidf_matrix)

In [None]:
# Function to get article recommendations based on article index
def get_recommendations_by_index(article_index, similarity_matrix, df, top_n=5):
    # Get similarity scores for the article
    sim_scores = list(enumerate(similarity_matrix[article_index]))
    
    # Remove the article itself
    sim_scores = [score for score in sim_scores if score[0] != article_index]
    
    # Sort based on similarity score
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get top N similar articles
    sim_scores = sim_scores[:top_n]
    
    # Get article indices and scores
    article_indices = [i[0] for i in sim_scores]
    similarity_scores = [i[1] for i in sim_scores]
    
    # Create a DataFrame with recommended articles
    recommendations = df.iloc[article_indices][['article_id', 'title', 'category', 'tags']].copy()
    recommendations['similarity'] = similarity_scores
    
    return recommendations

In [None]:
# Function to get query-based recommendations
def get_query_recommendations(query, vectorizer, matrix, similarity_type="tfidf", df=articles_df, top_n=5):
    # Preprocess the query
    processed_query = preprocess_text(query, remove_stopwords=True, lemmatize=True)
    
    # Transform the query using the same vectorizer
    query_vector = vectorizer.transform([processed_query])
    
    # Apply SVD if using LSA
    if similarity_type == "lsa":
        query_vector = svd.transform(query_vector)
        sim_matrix = lsa_matrix
    else:
        sim_matrix = matrix
    
    # Calculate cosine similarities
    if similarity_type == "lsa":
        cosine_similarities = cosine_similarity(query_vector, sim_matrix).flatten()
    else:
        cosine_similarities = cosine_similarity(query_vector, matrix).flatten()
    
    # Get top N similar articles
    sim_scores = list(enumerate(cosine_similarities))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[:top_n]
    
    # Get article indices and scores
    article_indices = [i[0] for i in sim_scores]
    similarity_scores = [i[1] for i in sim_scores]
    
    # Create a DataFrame with recommended articles
    recommendations = df.iloc[article_indices][['article_id', 'title', 'category', 'tags']].copy()
    recommendations['similarity'] = similarity_scores
    
    return recommendations

In [None]:
# Enhanced function for personalized recommendations
def get_personalized_recommendations(user_profile, food_items=None, health_status=None, 
                                     similarity_type="lsa", top_n=5):
    # Build query from user profile and interests
    query_parts = []
    
    # Add age-related terms
    if 'age' in user_profile:
        age = user_profile['age']
        if age < 18:
            query_parts.append("nutrition for children teenagers youth")
        elif age < 30:
            query_parts.append("nutrition for young adults")
        elif age < 50:
            query_parts.append("nutrition for adults middle age")
        else:
            query_parts.append("nutrition for seniors elderly older adults")
    
    # Add gender-related terms
    if 'gender' in user_profile:
        gender = user_profile['gender'].lower()
        if gender == 'male':
            query_parts.append("men's nutrition male health")
        elif gender == 'female':
            query_parts.append("women's nutrition female health")
    
    # Add food items
    if food_items:
        food_query = " ".join(food_items)
        query_parts.append(f"nutrition {food_query} recipes diet")
    
    # Add health status terms
    if health_status:
        # BMI-related recommendations
        if 'bmi' in health_status:
            bmi = health_status['bmi']
            if bmi < 18.5:
                query_parts.append("underweight nutrition gain weight healthy calories protein")
            elif bmi < 25:
                query_parts.append("healthy weight maintenance balanced diet")
            elif bmi < 30:
                query_parts.append("overweight nutrition weight management reducing calories")
            else:
                query_parts.append("obesity weight loss diet plan calorie deficit")
        
        # Add other health conditions
        if 'conditions' in health_status:
            conditions = " ".join(health_status['conditions'])
            query_parts.append(f"nutrition for {conditions} diet health")
    
    # Combine all query parts
    combined_query = " ".join(query_parts)
    
    # Get recommendations based on the combined query
    if similarity_type == "lsa":
        return get_query_recommendations(combined_query, tfidf_vectorizer, tfidf_matrix, "lsa", top_n=top_n), combined_query
    else:
        return get_query_recommendations(combined_query, tfidf_vectorizer, tfidf_matrix, "tfidf", top_n=top_n), combined_query

## 6. Model Evaluation

Let's evaluate our recommendation system.

In [None]:
# Compare TF-IDF vs LSA recommendations for a random article
random_article_idx = np.random.randint(0, len(articles_df))
article = articles_df.iloc[random_article_idx]

print(f"Selected article (ID: {article['article_id']}):\n{article['title']}")
print(f"Category: {article['category']}")
print(f"Tags: {article['tags']}")
print()

# Get TF-IDF recommendations
tfidf_recommendations = get_recommendations_by_index(random_article_idx, tfidf_similarity, articles_df)

print("TF-IDF Recommendations:")
for i, (_, row) in enumerate(tfidf_recommendations.iterrows()):
    print(f"{i+1}. {row['title']} (Similarity: {row['similarity']:.2f})")
    print(f"   Category: {row['category']}")
    print(f"   Tags: {row['tags']}")
    print()

# Get LSA recommendations
lsa_recommendations = get_recommendations_by_index(random_article_idx, lsa_similarity, articles_df)

print("LSA Recommendations:")
for i, (_, row) in enumerate(lsa_recommendations.iterrows()):
    print(f"{i+1}. {row['title']} (Similarity: {row['similarity']:.2f})")
    print(f"   Category: {row['category']}")
    print(f"   Tags: {row['tags']}")
    print()

In [None]:
# Evaluate query-based recommendations
test_queries = [
    "protein rich foods for muscle building",
    "low calorie foods for weight loss",
    "nutrition for pregnant women",
    "managing diabetes through diet",
    "heart healthy diet"
]

# Compare TF-IDF vs LSA for query-based recommendations
for query in test_queries:
    print(f"Query: {query}")
    print("-" * 40)
    
    # Get TF-IDF recommendations
    tfidf_query_recs = get_query_recommendations(query, tfidf_vectorizer, tfidf_matrix, "tfidf", top_n=3)
    
    print("TF-IDF Recommendations:")
    for i, (_, row) in enumerate(tfidf_query_recs.iterrows()):
        print(f"{i+1}. {row['title']} (Similarity: {row['similarity']:.2f})")
        print(f"   Category: {row['category']}")
    
    # Get LSA recommendations
    lsa_query_recs = get_query_recommendations(query, tfidf_vectorizer, tfidf_matrix, "lsa", top_n=3)
    
    print("\nLSA Recommendations:")
    for i, (_, row) in enumerate(lsa_query_recs.iterrows()):
        print(f"{i+1}. {row['title']} (Similarity: {row['similarity']:.2f})")
        print(f"   Category: {row['category']}")
    
    print("\n" + "=" * 60 + "\n")

## 7. Personalization Features

Let's test the personalized recommendation features.

In [None]:
# Test personalized recommendations with different user profiles
test_profiles = [
    {
        "name": "Teen Athlete",
        "profile": {"age": 16, "gender": "male"},
        "food_items": ["chicken", "rice", "broccoli", "protein shake"],
        "health_status": {"bmi": 21.5, "conditions": ["athletic performance"]}
    },
    {
        "name": "Young Adult Vegetarian",
        "profile": {"age": 25, "gender": "female"},
        "food_items": ["tofu", "beans", "spinach", "quinoa"],
        "health_status": {"bmi": 19.8, "conditions": ["vegetarian", "iron deficiency"]}
    },
    {
        "name": "Middle-aged with Diabetes",
        "profile": {"age": 48, "gender": "male"},
        "food_items": ["chicken breast", "brown rice", "salad"],
        "health_status": {"bmi": 29.2, "conditions": ["type 2 diabetes", "high blood pressure"]}
    },
    {
        "name": "Senior with Heart Condition",
        "profile": {"age": 72, "gender": "female"},
        "food_items": ["fish", "oatmeal", "blueberries", "walnuts"],
        "health_status": {"bmi": 26.5, "conditions": ["heart disease", "arthritis"]}
    }
]

# Test LSA personalized recommendations for each profile
for test in test_profiles:
    print(f"User: {test['name']}")
    print(f"Profile: {test['profile']}")
    print(f"Food Items: {test['food_items']}")
    print(f"Health Status: {test['health_status']}")
    print()
    
    # Get personalized recommendations
    recommendations, query = get_personalized_recommendations(
        test['profile'], test['food_items'], test['health_status'], 
        similarity_type="lsa", top_n=3
    )
    
    print(f"Generated Query: {query}")
    print("\nPersonalized Recommendations:")
    for i, (_, row) in enumerate(recommendations.iterrows()):
        print(f"{i+1}. {row['title']} (Similarity: {row['similarity']:.2f})")
        print(f"   Category: {row['category']}")
        print(f"   Tags: {row['tags']}")
        print()
    
    print("=" * 80)
    print()

## 8. Model Export

Now that we've built and evaluated our model, let's export it for use in the application.

In [None]:
# Create a model package with all the necessary components
model_package = {
    "vectorizer": tfidf_vectorizer,
    "svd": svd,
    "lsa_matrix": lsa_matrix,
    "tfidf_matrix": tfidf_matrix,
    "articles": articles_df[['article_id', 'title', 'category', 'tags', 'content']].to_dict('records')
}

# Save the model package
with open(RECOMMENDER_MODEL_PATH, 'wb') as f:
    pickle.dump(model_package, f)
print(f"Model package saved to {RECOMMENDER_MODEL_PATH}")

In [None]:
# Create a simplified version of the articles data for the app
articles_data = []
for _, article in articles_df.iterrows():
    articles_data.append({
        "id": article['article_id'],
        "title": article['title'],
        "category": article['category'],
        "tags": article['tags'].split(", "),
        "summary": article['content'][:200] + "..." if len(article['content']) > 200 else article['content']
    })

# Save the articles data as JSON for the app
with open(RECOMMENDER_MODEL_PATH, 'w') as f:
    json.dump({"articles": articles_data}, f)
print(f"Articles data saved to {RECOMMENDER_MODEL_PATH}")

## 9. Conclusion

In this notebook, we developed a robust article recommendation system for the NutriGenius application. Our system can provide personalized article recommendations based on user demographics, detected food items, and health status.

### Summary of achievements:
1. Built a TF-IDF vectorization model for article content representation
2. Applied LSA for dimensionality reduction and latent semantic understanding
3. Developed article-to-article recommendation functionality
4. Implemented query-based recommendation for user searches
5. Created personalized recommendation logic based on user profiles
6. Evaluated and compared different recommendation approaches
7. Exported the model for use in the mobile application

### Performance insights:
- LSA recommendations tend to capture more semantic relationships between articles
- TF-IDF recommendations are often more focused on keyword matching
- Personalized recommendations successfully prioritize content relevant to user profiles

### Next steps:
1. Integration with the mobile application
2. Collection of user feedback to improve recommendations
3. Exploration of hybrid recommendation approaches (content + collaborative filtering)
4. Regular updates to the article database 