#**Recommendation System**

## Summary

This notebook demonstrates the complete pipeline for building a recommendation system, from synthetic data generation to evaluation. The project includes generating diverse user and workout data, preprocessing it for collaborative and content-based filtering, and implementing a hybrid recommendation model. By combining collaborative filtering with content attributes like demographics, the system addresses key challenges such as personalization and the cold-start problem. The evaluation using metrics such as RMSE, MAE, precision, and recall validates the system’s effectiveness. Future improvements include testing on real-world data, exploring advanced models, and preparing the system for deployment. These steps aim to ensure a scalable and robust solution.

In [1]:
pip install scikit-surprise


Collecting scikit-surprise
  Downloading scikit_surprise-1.1.4.tar.gz (154 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/154.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m153.6/154.4 kB[0m [31m4.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m154.4/154.4 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.4-cp311-cp311-linux_x86_64.whl size=2505187 sha256=3fb0e89fd2b2a7bcfdc74efa47fc688570a1c5a86c8b8fd06bfe97bf4e4cc6b0
  Stored in directory: /root/.cache/pip/wheels/2a/8f/6e/7e28991

In [2]:
# Importing required libraries
import numpy as np
import pandas as pd
import random
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics.pairwise import cosine_similarity
from surprise import Dataset, Reader, SVD
from surprise.model_selection import train_test_split, cross_validate
from surprise import accuracy

**Generating Synthetic Data**

Since real-world datasets were not provided, generating synthetic data was essential for this project. I chose metrics like calories burned, workout duration, heart rate, and VO2 Max because they reflect various aspects of fitness engagement and intensity. These metrics can help personalize recommendations based on user preferences and performance.

To ensure diversity in the dataset, I used random sampling for workout types and generated random values within realistic ranges for each metric. This helps simulate varied user profiles and workout patterns, making the dataset more robust for testing the recommendation system.

In [3]:
# Define parameters for synthetic data
np.random.seed(42)  # Ensures reproducibility
user_ids = [f"user_{i}" for i in range(1, 101)]  # 100 users
workout_types = ['Yoga', 'Cardio', 'Strength', 'HIIT', 'Pilates', 'Dance', 'CrossFit', 'Zumba', 'Cycling']

# Generate synthetic dataset
data = []
for user_id in user_ids:
    for workout in random.sample(workout_types, 4):  # Each user performs 4 random workouts
        calories = np.random.randint(150, 800)
        duration = np.random.randint(15, 120)
        heart_rate = np.random.randint(80, 180)
        vo2_max = round(np.random.uniform(30, 60), 2)
        data.append([user_id, workout, calories, duration, heart_rate, vo2_max])

# Convert data to DataFrame
df = pd.DataFrame(data, columns=[
    'User ID', 'Workout Type', 'Calories Burned', 'Workout Duration', 'Heart Rate', 'VO2 Max'
])

# Display the first few rows
print(df.head())


  User ID Workout Type  Calories Burned  Workout Duration  Heart Rate  VO2 Max
0  user_1        Dance              252                66         172    35.50
1  user_1        Zumba              221                75         100    34.68
2  user_1      Pilates              616               101         154    43.78
3  user_1         HIIT              522               114         103    49.53
4  user_2      Pilates              458                16         167    54.97


**Normalizing Data**

**Why Normalize?:** The numeric features like calories burned and workout duration are on different scales, which could dominate similarity calculations or model training. Normalizing these features ensures that all metrics contribute equally to the recommendations.

In [4]:
# Normalize numeric features
scaler = MinMaxScaler()
df[['Calories Burned', 'Workout Duration', 'Heart Rate', 'VO2 Max']] = scaler.fit_transform(
    df[['Calories Burned', 'Workout Duration', 'Heart Rate', 'VO2 Max']]
)

# Encode workout types as numeric categories
df['Workout Type'] = df['Workout Type'].astype('category').cat.codes

# Display data summary
print(df.describe())


       Workout Type  Calories Burned  Workout Duration  Heart Rate     VO2 Max
count    400.000000       400.000000        400.000000  400.000000  400.000000
mean       3.665000         0.476283          0.511490    0.488061    0.527324
std        2.514674         0.290773          0.291390    0.302032    0.299362
min        0.000000         0.000000          0.000000    0.000000    0.000000
25%        2.000000         0.221705          0.278846    0.224490    0.262719
50%        4.000000         0.493798          0.548077    0.469388    0.540768
75%        6.000000         0.722481          0.750000    0.765306    0.809552
max        8.000000         1.000000          1.000000    1.000000    1.000000


**Data Preparation**

I used calories burned as the target rating for the recommendation system because it is a key indicator of workout effectiveness and aligns with the goal of recommending workouts that suit user needs.

The choice to use collaborative filtering was driven by its ability to leverage user-item interactions. While content-based methods could be explored, starting with collaborative filtering allows us to test the fundamental recommendation process.

In [5]:
# Define the rating matrix (use Calories Burned as the target rating)
rating_data = df[['User ID', 'Workout Type', 'Calories Burned']]

# Prepare dataset for Surprise library
reader = Reader(rating_scale=(0, 1))
data = Dataset.load_from_df(rating_data, reader)


In [6]:
# # Split data into training and test sets
# trainset, testset = train_test_split(data, test_size=0.2)

# print(f"Training set size: {len(trainset.all_ratings())}")
# print(f"Test set size: {len(testset)}")

# Split data into training and test sets
trainset, testset = train_test_split(data, test_size=0.2)

# Get the number of ratings in the training set
trainset_ratings_count = sum(1 for _ in trainset.all_ratings())
testset_ratings_count = len(testset)

print(f"Training set size: {trainset_ratings_count}")
print(f"Test set size: {testset_ratings_count}")



Training set size: 320
Test set size: 80


**Training the SVD Model**

**Why SVD?:** Singular Value Decomposition (SVD) is a popular algorithm for collaborative filtering. It works by decomposing the user-item interaction matrix into latent factors, capturing hidden patterns in user preferences and workout attributes.

While SVD is effective for sparse matrices, it may not capture contextual features like time of day or user demographics. These limitations will be addressed in the hybrid approach later.

In [7]:
# Train the SVD model
svd = SVD()
svd.fit(trainset)


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7c46d4ffd450>

**Model Evaluation**

To ensure a comprehensive evaluation, I used RMSE and MAE for accuracy and precision/recall for relevance. These metrics provide insights into both prediction quality and recommendation effectiveness.

To validate the model's robustness, I used 5-fold cross-validation. This method helps identify performance consistency across different data splits.

Precision and recall trade-offs are crucial in recommendation systems. For example, high precision might result in fewer but highly relevant recommendations, while high recall might cover more relevant items but include noise.

In [8]:
# Cross-validate the model
cv_results = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)


### 4. Ranking-Based Evaluation for SVD Model

# Generate ground truth for evaluation
import random
ground_truth = {user_id: random.sample(df[df['User ID'] == user_id]['Workout Type'].tolist(), 2)
                for user_id in user_ids if len(df[df['User ID'] == user_id]['Workout Type']) >= 2}

# Function to predict and rank top-N recommendations
def recommend_top_n_svd(user_id, num_recommendations=3):
    predicted_ratings = {workout: svd.predict(user_id, workout).est for workout in df['Workout Type'].unique()}
    ranked_workouts = sorted(predicted_ratings, key=predicted_ratings.get, reverse=True)
    return ranked_workouts[:num_recommendations]

# Evaluate precision and recall
precision_scores = []
recall_scores = []

for user_id, actual_workouts in ground_truth.items():
    recommended_workouts = recommend_top_n_svd(user_id, num_recommendations=3)
    true_positives = len(set(recommended_workouts) & set(actual_workouts))
    precision = true_positives / len(recommended_workouts) if recommended_workouts else 0
    recall = true_positives / len(actual_workouts) if actual_workouts else 0

    precision_scores.append(precision)
    recall_scores.append(recall)

# Calculate average precision and recall
average_precision = sum(precision_scores) / len(precision_scores)
average_recall = sum(recall_scores) / len(recall_scores)

print(f"Baseline Model - Average Precision: {average_precision:.2f}")
print(f"Baseline Model - Average Recall: {average_recall:.2f}")

# Split data into training and test sets
trainset, testset = train_test_split(data, test_size=0.2)

# Train the SVD model
svd = SVD()
svd.fit(trainset)

# Evaluate the model using cross-validation
print("\nCross-Validation Results:")
cv_results = cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# Evaluate on the test set
predictions = svd.test(testset)

# Calculate performance metrics
rmse = accuracy.rmse(predictions)
mae = accuracy.mae(predictions)

print(f"RMSE: {rmse}")
print(f"MAE: {mae}")

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.3114  0.3080  0.3320  0.2994  0.3075  0.3117  0.0109  
MAE (testset)     0.2565  0.2585  0.2825  0.2487  0.2541  0.2601  0.0117  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Baseline Model - Average Precision: 0.20
Baseline Model - Average Recall: 0.30

Cross-Validation Results:
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.3153  0.2936  0.3216  0.3027  0.3287  0.3124  0.0127  
MAE (testset)     0.2651  0.2481  0.2726  0.2525  0.2806  0.2638  0.0122  
Fit time          0.00    0.00    0.00    0.00    0.00    0.00    0.00    
Test time         0.00    0.00    0.00    0.00    0.00    0.00    0.00    
RMSE: 0.2484
MAE:  0.2007
RMSE: 0.24

In [9]:
# Predict for a specific user and workout
user_id = "user_10"
workout_type = 3  # Example: Encoded value for 'HIIT'

predicted_rating = svd.predict(user_id, workout_type)
print(f"Predicted Rating for {user_id} on workout {workout_type}: {predicted_rating.est:.2f}")


Predicted Rating for user_10 on workout 3: 0.30


**Extending the Dataset by Incorporating Demographics and Contextual Features**

In [10]:
# Add demographic and contextual features
ages = np.random.randint(18, 65, len(user_ids))  # Ages between 18 and 65
genders = [random.choice(['Male', 'Female', 'Non-binary']) for _ in user_ids]
fitness_levels = [random.choice(['Beginner', 'Intermediate', 'Advanced']) for _ in user_ids]
time_of_day = ['Morning', 'Afternoon', 'Evening']

# Map fitness levels to numeric categories
fitness_level_mapping = {'Beginner': 1, 'Intermediate': 2, 'Advanced': 3}

# Extend synthetic data generation with demographics
extended_data = []
for i, user_id in enumerate(user_ids):
    for workout in random.sample(workout_types, 4):
        calories = np.random.randint(150, 800)
        duration = np.random.randint(15, 120)
        heart_rate = np.random.randint(80, 180)
        vo2_max = round(np.random.uniform(30, 60), 2)
        time = random.choice(time_of_day)
        extended_data.append([
            user_id, workout, calories, duration, heart_rate, vo2_max,
            ages[i], genders[i], fitness_level_mapping[fitness_levels[i]], time
        ])

# Create DataFrame for extended data
df_extended = pd.DataFrame(extended_data, columns=[
    'User ID', 'Workout Type', 'Calories Burned', 'Workout Duration', 'Heart Rate', 'VO2 Max',
    'Age', 'Gender', 'Fitness Level', 'Time of Day'
])

# Convert categorical features to numeric
df_extended['Gender'] = df_extended['Gender'].astype('category').cat.codes
df_extended['Time of Day'] = df_extended['Time of Day'].astype('category').cat.codes

print(df_extended.head())


  User ID Workout Type  Calories Burned  Workout Duration  Heart Rate  \
0  user_1     CrossFit              624                64         175   
1  user_1     Strength              202                76         134   
2  user_1        Dance              162                15         166   
3  user_1        Zumba              512               112         111   
4  user_2         HIIT              204                47          83   

   VO2 Max  Age  Gender  Fitness Level  Time of Day  
0    44.21   59       0              2            1  
1    46.95   59       0              2            0  
2    55.40   59       0              2            0  
3    33.89   59       0              2            1  
4    51.21   18       0              1            0  


**Hybrid Recommendation System with Scikit-Learn**

**Why Combine Methods?:** Collaborative filtering excels at capturing user-item interactions, while content-based filtering leverages user and item attributes. By combining these, the hybrid approach provides more personalized recommendations.

In this implementation, I calculated user similarity based on collaborative filtering using a user-item interaction matrix. For the content-based filtering, I used features like demographics (e.g., age, gender, fitness level) and contextual data (e.g., time of day). By averaging the similarities from both approaches, I created a hybrid similarity matrix that balances interaction-based patterns with attribute-based insights.

The hybrid approach improves recommendation relevance by addressing the cold-start problem for new users (via content attributes) while maintaining personalized recommendations based on past interactions.

In [11]:
# Normalize numeric features
df_extended[['Calories Burned', 'Workout Duration', 'Heart Rate', 'VO2 Max', 'Age', 'Fitness Level']] = \
    scaler.fit_transform(df_extended[['Calories Burned', 'Workout Duration', 'Heart Rate', 'VO2 Max', 'Age', 'Fitness Level']])


**Collaborative Filtering Matrix**

In [12]:
# User-item interaction matrix
interaction_matrix = pd.pivot_table(df_extended, index='User ID', columns='Workout Type', values='Calories Burned').fillna(0)
user_similarity = cosine_similarity(interaction_matrix)


**Content-Based Matrix**

In [13]:
# Content-based feature matrix
content_features = df_extended[['User ID', 'Age', 'Gender', 'Fitness Level', 'Time of Day']].drop_duplicates(subset='User ID').set_index('User ID')
content_similarity = cosine_similarity(content_features)


**Hybrid Similarity**

In [14]:
# Combine collaborative and content-based similarities
hybrid_similarity = (user_similarity + content_similarity) / 2


**Recommendation Function**

In [15]:
def recommend_workouts(user_id, num_recommendations=3):
    similar_users = hybrid_similarity[user_ids.index(user_id)].argsort()[::-1]
    recommended_workouts = []
    for similar_user_idx in similar_users:
        similar_user_id = user_ids[similar_user_idx]
        user_workouts = df_extended[df_extended['User ID'] == similar_user_id]['Workout Type'].tolist()
        recommended_workouts.extend(user_workouts)
        if len(set(recommended_workouts)) >= num_recommendations:
            break
    return list(set(recommended_workouts))[:num_recommendations]

# Test recommendations
print(f"Recommended Workouts for user_10: {recommend_workouts('user_10')}")


Recommended Workouts for user_10: ['Zumba', 'Strength', 'HIIT']


**Model Evaluation**

In [16]:
# Generate ground truth
ground_truth = {user_id: random.sample(df_extended[df_extended['User ID'] == user_id]['Workout Type'].tolist(), 2)
                for user_id in user_ids if len(df_extended[df_extended['User ID'] == user_id]['Workout Type']) >= 2}

# Evaluate recommendations
precision_scores = []
recall_scores = []

for user_id, actual_workouts in ground_truth.items():
    recommended_workouts = recommend_workouts(user_id, num_recommendations=3)
    true_positives = len(set(recommended_workouts) & set(actual_workouts))
    precision = true_positives / len(recommended_workouts) if recommended_workouts else 0
    recall = true_positives / len(actual_workouts) if actual_workouts else 0

    precision_scores.append(precision)
    recall_scores.append(recall)

# Average metrics
average_precision = sum(precision_scores) / len(precision_scores)
average_recall = sum(recall_scores) / len(recall_scores)

print(f"Average Precision: {average_precision:.2f}")
print(f"Average Recall: {average_recall:.2f}")


Average Precision: 0.51
Average Recall: 0.76


# Results
RMSE and MAE indicate prediction accuracy, while precision and recall evaluate recommendation relevance. The results showed reasonable accuracy and relevance, validating the model’s effectiveness.

**Comparison and Limitations:** Comparing the baseline collaborative filtering with the hybrid model revealed improvements in precision and recall. However, synthetic data lacks real-world noise, which might limit generalizability. Future steps could involve testing on real-world datasets to validate the scalability and robustness of the hybrid approach.