# Recommender
A Machine Learning-based tool that help users discover items they might like, based on data and preferences!  

1. **Netflix:** suggesting what movie to watch next.
2. **Spotify:** curating music playlists for your vibe.

## Why it matters? 
It's a beautiful example of how **data meets personalization**. We train models to understand behaviour patterns, make predictions, and improve with feedback, core aspects of ML.

## Two Main Approaches:
| Recommender Type         | Description                                                         | Example Use Case                                  |
|--------------------------|---------------------------------------------------------------------|---------------------------------------------------|
| Content-Based Filtering  | Recommends items based on user’s previous choices and item features | Suggesting articles with similar topics           |
| Collaborative Filtering  | Recommends items based on similar users' preferences                | “Users who liked this also liked…” on e-commerce  |

### Content-Based Filtering
It's a recommendation technique where we look at the **feature of the items** a user has liked, and recommend **similar items** based on those features.  

* Think of it as: ***if you liked item A with feature X, you'll probably like item B, which also has feature X.***
* It doesn't require other users' data, just users' own interaction history.

#### Example
| Movie Title       | Genre         | Lead Actor       | Runtime (min) | IMDb Rating |
|-------------------|---------------|------------------|----------------|--------------|
| Inception         | Sci-Fi        | Leonardo DiCaprio| 148            | 8.8          |
| Interstellar      | Sci-Fi        | Matthew McConaughey| 169          | 8.6          |
| The Notebook      | Romance       | Ryan Gosling     | 123            | 7.8          |
| Shutter Island    | Thriller      | Leonardo DiCaprio| 138            | 8.1          |

1. **Each row** represents an **item** to recommend!
2. The features or **columns** are item characteristics!

> **Example question** to ask of the dataset: What other movie would you recommend to someone who likes Inception?

#### Example Scenario
Say you've watched three Sci-Fi movies rated very highly. Your user profile becomes:  
1. Genre: Sci-Fi heavy.  
2. Preference for high IMDB ratings.
3. Maybe specific directors or themes (if features allow).  

Now, when recommending a new movie, the system finds items with **similar features** like **Genre** and **IMDB Rating** and ranks them based on **similarity**.  

### Similarity Metrics
Similarity metrics are **mathematical tools** used to quantify how alike two objects are. In content-based filtering, they help measure the **closeness between a user's profile** and an **item's feature vector**,  
so we can rank and recommend the most relevant items.

### Two popular similarity metrics
| Metric             | Definition                                                            | Pros                                                   | Cons                                                  | When to Use                                         |
|--------------------|----------------------------------------------------------------------|--------------------------------------------------------|-------------------------------------------------------|----------------------------------------------------|
| **Cosine Similarity**   | Measures angle between two vectors (focuses on direction, not magnitude) | Works well with high-dimensional, sparse data (like text) | Doesn't account for magnitude differences             | Text-based data, user/item profiles in vector form |
| **Euclidean Distance** | Measures straight-line distance between two vectors (includes magnitude) | Easy to interpret, considers magnitude                | Can be skewed by large scale differences              | When magnitude matters (e.g., numerical ratings)   |

#### Cosine Similarity Metric
Cosine similarity calculates the **angle between two vectors** in a multi-dimensional space.  
It especially helpful when you're deadling with **text, user preference, or item features**.  

1. It focuses on **direction** instead of magnitude.
2. It can handle **high dimensions**. 
3. It works well on **sparse data** (data containing many 0 values).

**Formula**: $$\text{Cosine Similarity} = \frac{A \cdot B}{\|A\| \|B\|}$$ 
Where:   
1. $𝐴 ⋅ 𝐵$ is the dot product of vectors A and B.  
2. ∥𝐴∥ and ∥𝐵∥ are their magnitudes (lengths).

> The smaller the angle, the more similar the vectors are. 

#### Euclidean Distance Metric
Euclidean distance is the straight-line distance between two points (vectors) in a multidimensional space.  
It reflects the actual geometric distance just like you'd use a ruler in real life.  

If we have two vectors $A = [a_1, a_2, a_3, ...]$ and $B = [b_1, b_2, b_3, ...]$:  
**Formula**: $$\text{Euclidean Distance = } sqrt((a1 - b1)^2 + (a2 - b2)^2 + ... + (an - bn)^2)$$  

> Smaller distance = more similar.





## Content-Based Filtering Recommendation System
Building a recommendation fruit system based on **carbohydrates**, **Protein** and **Sugar**.

In [2]:
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
dataset = pd.read_csv("datasets/fruitsnutrition.csv")

In [4]:
dataset.head()

Unnamed: 0,fruit,energy (kcal/kJ),water (g),protein (g),total fat (g),carbohydrates (g),fiber (g),sugars (g),calcium (mg),iron (mg),...,potassium (mg),sodium (g),vitamin A (IU),vitamin C (mg),vitamin B1 (mg),vitamin B2 (mg),viatmin B3 (mg),vitamin B5 (mg),vitamin B6 (mg),vitamin E (mg)
0,Apple,48/200,86.7,0.27,0.13,12.7,1.3,10.1,5,0.07,...,90,0,38,4.0,0.019,0.028,0.091,0.071,0.037,0.05
1,Apricot,48/201,86.4,1.4,0.39,11.12,2.0,9.24,13,0.39,...,259,1,1926,10.0,0.03,0.04,0.6,0.24,0.054,0.89
2,Avocado,160/670,73.23,2.0,14.7,8.53,6.7,0.66,12,0.55,...,485,7,146,10.0,0.067,0.13,1.738,1.389,0.257,2.07
3,Banana,89/371,74.91,1.09,0.33,22.84,2.6,12.23,5,0.26,...,358,1,64,8.7,0.031,0.073,0.665,0.334,0.367,0.1
4,Blackberries,43/181,88.15,1.39,0.49,9.61,5.3,4.88,29,0.62,...,162,1,214,21.0,0.02,0.026,0.646,0.276,0.03,1.17


In [15]:
dataset.columns

Index(['fruit', 'energy (kcal/kJ)', 'water (g)', 'protein (g)',
       'total fat (g)', 'carbohydrates (g)', 'fiber (g)', 'sugars (g)',
       'calcium (mg)', 'iron (mg)', 'magnessium (mg)', 'phosphorus (mg)',
       'potassium (mg)', 'sodium (g)', 'vitamin A (IU)', 'vitamin C (mg)',
       'vitamin B1 (mg)', 'vitamin B2 (mg)', 'viatmin B3 (mg)',
       'vitamin B5 (mg)', 'vitamin B6 (mg)', 'vitamin E (mg)'],
      dtype='object')

In [63]:
# Filter out rows with '-' in any of the selected columns (excluding 'fruit')
filtered_data = dataset[
    ~(dataset[['carbohydrates (g)', 'protein (g)', 'sugars (g)']] == '-').any(axis=1)
]

In [66]:
fruit_nutrition = filtered_data[['fruit', 'carbohydrates (g)', 'protein (g)', 'sugars (g)']]

In [67]:
fruit_nutrition.head()

Unnamed: 0,fruit,carbohydrates (g),protein (g),sugars (g)
0,Apple,12.7,0.27,10.1
1,Apricot,11.12,1.4,9.24
2,Avocado,8.53,2.0,0.66
3,Banana,22.84,1.09,12.23
4,Blackberries,9.61,1.39,4.88


In [68]:
fruit_nutrition = fruit_nutrition.set_index('fruit')
fruit_nutrition.index.name = None

In [75]:
fruit_nutrition.head()

Unnamed: 0,carbohydrates (g),protein (g),sugars (g)
Apple,12.7,0.27,10.1
Apricot,11.12,1.4,9.24
Avocado,8.53,2.0,0.66
Banana,22.84,1.09,12.23
Blackberries,9.61,1.39,4.88


#### Example Cosine-Similarity

In [70]:
fruit_nutrition.loc[['Apple', 'Cherry']]

Unnamed: 0,carbohydrates (g),protein (g),sugars (g)
Apple,12.7,0.27,10.1
Cherry,16.01,1.06,12.82


In [71]:
cosine_similarity(fruit_nutrition.loc[['Apple', 'Cherry']])

array([[1.        , 0.99938207],
       [0.99938207, 1.        ]])

#### Cosine-Similarity on the dataset

In [74]:
cosine_df = pd.DataFrame(cosine_similarity(fruit_nutrition), index=fruit_nutrition.index, columns=fruit_nutrition.index)
cosine_df.head(3)

Unnamed: 0,Apple,Apricot,Avocado,Banana,Blackberries,Blueberry,Carambola,Cherimoya,Cherry,Clementine,...,Passion,Peaches,Pear,Pineapple,Plum,Pomegranate,Raspberry,Strawberry,Tangerine,Watermelon
Apple,1.0,0.996581,0.810283,0.983495,0.973618,0.997254,0.983931,0.99755,0.999382,0.999028,...,0.97255,0.997274,0.994287,0.99948,0.998623,0.997647,0.947316,0.992908,0.999522,0.998832
Apricot,0.996581,1.0,0.812952,0.978354,0.974955,0.994391,0.986876,0.997599,0.998829,0.998345,...,0.969931,0.999297,0.988945,0.996766,0.998497,0.997786,0.943672,0.991822,0.998542,0.999399
Avocado,0.810283,0.812952,1.0,0.90011,0.921369,0.851443,0.896275,0.843686,0.815455,0.828642,...,0.924016,0.792905,0.864738,0.828536,0.791885,0.842316,0.95377,0.873667,0.817238,0.810592


### Content-Based Filtering Recommendation System

In [None]:
def recommend_fruits_based_on_carbs_pro_sugar(fruit_name, similarity_df):
    if(fruit_name not in similarity_df.columns):
        return f"{fruit_name} is not found in the fruit list"
    
    return similarity_df[[fruit_name]].sort_values(by=fruit_name, ascending=False)

In [89]:
recommend_fruits_based_on_carbs_pro_sugar(fruit_name='Avocado', similarity_df=cosine_df).head()

Unnamed: 0,Avocado
Avocado,1.0
Lime,0.983309
Raspberry,0.95377
Cranberries,0.951712
Passion,0.924016


## Collaborative Filtering
It is a technique that makes recommendations based on **user behavior and preferences**, rather than item features. 
> If a user A and user B both liked the same movies, and user A liked a new movie that user B hasn't seen yet, the system might recommend that movie to the user B.

### Types of Collaborative Filtering

| Type                         | Description                                                                 | Strengths                                       |
|------------------------------|-----------------------------------------------------------------------------|------------------------------------------------|
| **User-Based Filtering**         | Finds users similar to the target user and recommends items they liked      | Personalized recommendations                    |
| **Item-Based Filtering**         | Finds items similar to those the user liked and recommends those            | Stable and efficient with large datasets        |
| **Model-Based Filtering**        | Uses machine learning models like Matrix Factorization, Neural Networks     | Handles sparsity and scales better              |
| **Hybrid Collaborative Filtering**| Combines collaborative and content-based methods                           | More accurate and reduces cold-start problems   |

### Example of a taste rating table (user-item matrix)

| Person   | Apple | Banana | Mango | Orange |
|----------|-------|--------|-------|--------|
| Alice    | 4.5   | 3.8    | 4.9   | 4.0    |
| Bob      | 3.2   | 4.6    | 4.0   | 3.5    |
| Clara    | 4.8   | 2.9    | 5.0   | 4.2    |
| David    | 3.9   | 4.4    | 3.7   | 4.1    |

1. Each row represents a **user**.
2. The features (columns) are **items** to recommend.

> **Example** question to ask of the dataset: What other fruit would you recommend to someone who likes mangos?

### User-Item Matrix
A User-Item matrix is a 2D grid where:

1. **Rows** represent users!
2. **Columns** represent items (e.g., movies, products, songs)!
3. **Values** represent interactions — typically ratings, but can also be clicks, purchases, or binary likes!

It’s typically the first step in building a collaborative filtering system because:

1. It captures all known user-item interactions
2. It allows algorithms to identify patterns and similarities between users or items
3. It’s the basis for computing similarity scores (e.g., cosine similarity, Pearson correlation)
4. It’s used for predicting missing values (like Alice’s rating for Movie C)

> Often we need to **pivot** our data in Python to structure it as a user-item matrix.

## Collaborative Filtering Recommendation System

In [3]:
fruit_ratings = pd.read_csv("datasets/fruit_ratings.csv")

In [4]:
fruit_ratings.head()

Unnamed: 0,User,Fruit,Rating
0,user0,Lemon,4.3
1,user1,Lemon,4.6
2,user2,Lemon,4.2
3,user3,Lemon,4.4
4,user4,Lemon,4.1


**Restructuring** the data into a user-item matrix.

In [6]:
X = fruit_ratings.pivot(index='User', columns='Fruit', values='Rating')

In [9]:
X.head()

Fruit,Banana,Lemon,Lime,Mango,Peach,Pineapple
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
user0,1.1,4.3,4.7,1.0,2.4,4.9
user1,1.2,4.6,4.5,1.3,2.1,4.8
user10,4.8,3.7,4.1,2.2,3.3,2.4
user11,2.5,2.9,3.7,4.1,1.8,4.1
user12,4.2,4.1,2.3,3.2,5.0,1.7


In [8]:
X.isna().sum()

Fruit
Banana       0
Lemon        0
Lime         1
Mango        0
Peach        0
Pineapple    0
dtype: int64