# Adagio Recommendation System

## Introduction

One way to keep a consumer longer on a page is by showing products that may interest him. To do so, one should build a **recommendation system**.

There exists different models. **A simple one is to just recommend other products from the same category**. For example, if someone is looking for a TV on an electronic website, the recommendations can be other TVs (nor refrigerators, coffee makers, so on).

A more sophisticated way, is to **recommend the products that are the most similars**. For example, if the same person is looking for a low consumption TV on the same website, the recommendations can be other low consumption electronics (not high consumption TVs).

The latest carries a problem. **We first need to define a metric to measure similarities**. That is, assign an ordered value (a number) to each pair of product, where the largest means more similarity.

In this notebook we obtained similarities from the data collected (from [Adagio website](https://adagio.cl/)). And created an algorithm capable of recommeding tea related products.

---

Import the basic necessary libraries.

In [1]:
import numpy as np
np.random.seed(151515)
import pandas as pd

### Import Dataset

Import the dataset.

In [2]:
df = pd.read_csv("data_scraper/products.csv")

The data looks like:

In [3]:
df.head(5)

Unnamed: 0,amount,benefit,description_1,description_2,format,link,name,price,property_1,property_2,property_3,property_4,property_5,raters,rating,temperature,time
0,"2,5g/250ml",hierba,"exquisita mezcla de naranja, hibisco, rosa mos...","naranja, hibisco, rosa mosqueta y sabor natura...",60 gramos,https://adagio.cl/products/ruby-orange,ruby orange,Agotado,antioxidante,sin-cafeina,,,,1.0,5.0,100 celsius,5 min
1,"2,5g/250ml",negro,"te negro descafeinado de ceylan, trozos de coc...","te negro descafeinado de ceylan, trozos de coc...",57 gramos,https://adagio.cl/products/ghost-drink,ghost drink,5990,energizante,sin-cafeina,,,,1.0,5.0,100 celsius,3 min
2,,,almacena el te de forma conveniente y segura e...,elige tu te favorito y guardalo en tu tarro me...,,https://adagio.cl/products/tarro-grande-negro,tarro grande negro,2490,,,,,,2.0,3.5,,
3,,,infusor de goma con rejilla de acero inoxidabl...,echar la cantidad recomendada de te en el infu...,grafito,https://adagio.cl/products/infusor-de-goma-gra...,infusor de goma grafito,4990,,,,,,1.0,5.0,,
4,"2,5g/250ml",negro,"te negro confeti verde, ojos de caramelo y sab...","te negro confeti verde, ojos de caramelo y sab...",57 gramos,https://adagio.cl/products/monster-mash,monster mash,5990,energizante,,,,,1.0,3.0,100 celsius,3 min


We have the following information of the dataset.

In [4]:
df.info()
print(f"\nDataFrame Shape: {df.shape}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 642 entries, 0 to 641
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   amount         386 non-null    object 
 1   benefit        384 non-null    object 
 2   description_1  636 non-null    object 
 3   description_2  638 non-null    object 
 4   format         555 non-null    object 
 5   link           642 non-null    object 
 6   name           642 non-null    object 
 7   price          642 non-null    object 
 8   property_1     408 non-null    object 
 9   property_2     256 non-null    object 
 10  property_3     56 non-null     object 
 11  property_4     5 non-null      object 
 12  property_5     0 non-null      float64
 13  raters         455 non-null    float64
 14  rating         455 non-null    float64
 15  temperature    386 non-null    object 
 16  time           386 non-null    object 
dtypes: float64(3), object(14)
memory usage: 85.4+ KB

Data

drop the `property_5` column, as it has 0 non-null rows.

In [5]:
df.drop(columns="property_5", inplace=True)

### Remove Duplicates

Duplicates can affect the model. The ideal is not to recommend the same product twice.

In [6]:
duplicates_number = sum(df["link"].duplicated())
print(f"Number of duplicated links: {duplicates_number}")

Number of duplicated links: 379


Drop all the duplicated links.

In [7]:
df.drop_duplicates(inplace=True)
df = df.sample(frac=1, random_state=5454)
df.reset_index(drop=True, inplace=True)
df = df.copy()

In [8]:
print(f"Final DataFrame shape: {df.shape}")

Final DataFrame shape: (263, 16)


Finally, we have the following information.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 263 entries, 0 to 262
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   amount         154 non-null    object 
 1   benefit        153 non-null    object 
 2   description_1  261 non-null    object 
 3   description_2  260 non-null    object 
 4   format         218 non-null    object 
 5   link           263 non-null    object 
 6   name           263 non-null    object 
 7   price          263 non-null    object 
 8   property_1     162 non-null    object 
 9   property_2     91 non-null     object 
 10  property_3     21 non-null     object 
 11  property_4     2 non-null      object 
 12  raters         182 non-null    float64
 13  rating         182 non-null    float64
 14  temperature    154 non-null    object 
 15  time           154 non-null    object 
dtypes: float64(2), object(14)
memory usage: 33.0+ KB


## Similarities

### Categorical Variables

Some products have labels like *te negro*, *energizante* and so on. **The algorithm should recommend products with similar labels**.

We have the following unique values:

In [10]:
print(f"Unique Values in format column:\n {df.format.unique()}\n")
print(f"Unique Values in benefit column:\n {df.benefit.unique()}\n")
print(f"Unique Values in property_1 column:\n {df.property_1.unique()}\n")
print(f"Unique Values in property_2 column:\n {df.property_2.unique()}\n")
print(f"Unique Values in property_3 column:\n {df.property_3.unique()}\n")
print(f"Unique Values in property_4 column:\n {df.property_4.unique()}\n")

Unique Values in format column:
 [nan '57 gramos' '43 gramos' '10 teabags' '85 gramos' '70 gramos'
 'turquesa infusor dorado' 'roja' '35 gramos' 'turquesa' 'iridiscent'
 'rojo' 'dark blue' '60 gramos' 'turquesa granito' '27 gramos' 'negro'
 'blanco' '85 gramos / 11 bolitas' 'dorada' '40 gramos' 'celeste' 'lila'
 'azul' '15 teabags' 'lila granito' 'grafito' 'hojas naranjo' 'fucsia'
 'grande' 'iridescent lila' '28 gramos' 'verde' 'pack good moments'
 '15 piramides' 'moonstar turquesa' 'full color menta' 'full color lila'
 'amarillo' 'menta' '50 gramos' 'estrella salmon' 'estrella turquesa'
 'estrella fucsia' 'nubes turquesa' 'transparente' '30 teabags' 'negra'
 'infusor grafito' 'black']

Unique Values in benefit column:
 [nan 'negro' 'hierba' 'rooibos' 'verde' 'oolong' 'matcha' 'rojo' 'blanco'
 'amarillo']

Unique Values in property_1 column:
 [nan 'antioxidante' 'energizante' 'relajante' 'digestion' 'sin-cafeina']

Unique Values in property_2 column:
 [nan 'energizante' 'sin-cafeina' '

Fill the NAs with blank text.

In [11]:
df["format"].fillna(value="", inplace=True)
df["benefit"].fillna(value="", inplace=True)
df["property_1"].fillna(value="", inplace=True)
df["property_2"].fillna(value="", inplace=True)
df["property_3"].fillna(value="", inplace=True)
df["property_4"].fillna(value="", inplace=True)

Clean the text to not have special characters.

In [12]:
def clean_text(string):
    for character in string:
        if character not in "abcdefghijklmnopqrstuvwxyz ":
            string = string.replace(str(character), " ")
    return string

def clean_cafeina(string):
    for character in string:
        if character == "-":
            string = string.replace(str(character), "")
    return string

df["format"] = df["format"].apply(lambda x: clean_text(x))
df["properties"] = df["property_1"] + " " + df["property_2"] + " " \
    + df["property_3"] + " " + df["property_4"]
df["properties"] = df["properties"].apply(lambda x: x.strip())
df["properties"] = df["properties"].apply(lambda x: clean_cafeina(x))

Put all properties together in `all_properties` column.

In [13]:
df["all_properties"] = df["format"] + " " + df["benefit"] + " " + df["properties"]
df["all_properties"] = df["all_properties"].apply(lambda x: x.strip())
df["all_properties"] = df["all_properties"].apply(lambda x: " ".join(x.split()))

#### Vectorization

To measure how much one product looks like another, **we need to represent each item as a vector of some space**. This process is known as *vectorization*.

The column `all_properties` consists of sets of label words. One can count **how many labels appears in each product**. And this can be done using the $CountVectorizer$ class from scikit-learn.

In [14]:
from sklearn.feature_extraction.text import CountVectorizer

In [15]:
count = CountVectorizer()
count_matrix = count.fit_transform(df["all_properties"])

We get the $count$_$matrix$ of shape:

In [16]:
count_matrix.shape

(263, 49)

Where each row of the `all_properties` column is represented as a row in the $263\times 52$ matrix. And the $52$ columns corresponds to the $52$ unique labels.

For example, the $2^{nd}$ row of `all_properties` says:

In [17]:
print(df["all_properties"][1])

gramos negro antioxidante energizante


And the second row of the $count$_$matrix$ is:

In [18]:
count_matrix.toarray()[1]

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0])

We see that the $1$s corresponds to the labels that appears on the $2^{nd}$ row. While the $0$s to the labels that are not in the $2^{nd}$ row.

In total, we have $263$ vectors inside a $52$-dimensional vector space.

#### Similarity

Now that we have the vectors, **we must find a way to measure distances**. One can be obtained using the *[dot product](https://en.wikipedia.org/wiki/Dot_product#Geometric_definition)* of two vectors $\vec{v}$ and $\vec{w}$:

$$\vec{v}\cdot\vec{w}=\|v\|\|w\|\cos \theta,$$

where $\theta$ is the angle between $\vec{v}$ and $\vec{w}$, and $\|\cdot\|$ is the euclidean norm.

So we are calculating the angle:
$$
\cos\theta = \frac{\vec{v}\cdot\vec{w}}{\|v\|\|w\|}
$$
to measure the distance between two vectors. As **lower angle cosines imply closer vectors**.

For example, $\cos\theta_{1} = \frac{\sqrt{2}}{2}$ is higher than $\cos\theta_{2} =\frac{-\sqrt{2}}{2}$. which implies that the distance between $v$ and $w_{1}$ is lower than the distance between $v$ and $w_{2}$.


<img src="images/angle1.jpeg" width=300/> <img src="images/angle2.jpeg" width=445/>

This method is known as $cosine\_similarity$.

In [19]:
from sklearn.metrics.pairwise import cosine_similarity

Create the cosine similarity matrix.

In [20]:
cosim_categorical = cosine_similarity(count_matrix, count_matrix)

It has the following shape:

In [21]:
cosim_categorical.shape

(263, 263)

**Note: vectors only have positive values, which imply that the cosine similarity range is between $[0, 1]$.**

### Description

Almost every product on the website has a description on it. In which they briefly tell us about the product. At the same time, each product has a second description specifying how to use it, or the ingredients.

All this information was scraped. And just like before, **we want to extract vectors from it**.

First, check the number of missing rows.

In [22]:
print(f"Number of missing rows in description_1: {sum(df.description_1.isna())}")
print(f"Number of missing rows in description_2: {sum(df.description_2.isna())}")

Number of missing rows in description_1: 2
Number of missing rows in description_2: 3


Replace the missing rows with a blank string.

In [23]:
df["description_1"].fillna(value="", inplace=True)
df["description_2"].fillna(value="", inplace=True)

To facilitate the vectorization process, remove all special characters. 

Define the following cleaning function.

In [24]:
def clean_text(string):
    for character in string:
        if character not in "abcdefghijklmnopqrstuvwxyz ":
            string = string.replace(str(character), " ")
    return string

Join both columns in a single one.

In [25]:
df["description_cleaned"] = df["description_1"].apply(lambda x: clean_text(x)) + \
    " " + df["description_2"].apply(lambda x: clean_text(x))

#### Vectorization

One would be tempted to repeat the vectorizer used in the previous part. But there, each word represented a label. And if we count each word as a label, the most commonly used would be the *stop-words* (empty words used to concatenate ideas; words like "el, la, para, por")

A way to solve this is by replacing each sentence with key words that represent the principal ideas. But **a simpler way is to use the $TfidfVectorizer$**.

In addition to counting how many times a word appears in a text. **We multiply the count by the "importance" of each word**. The final formula is, for a word $t$ and a document $d$:

$$
\text{Tfidf}\hspace{0.05in}(t, d) = \text{Tf}\hspace{0.05in}(t, d)\hspace{0.05in} \cdot \frac{N}{nf}(t)
$$

Where:
- $\text{Tf}\hspace{0.05in}(t, d)$  is the frequency of the word $t$ in the document $d$,
- $\text{Idf}\hspace{0.05in}(t)$ is the importance of the word $t$,
- $N$ is the total number of documents, and
- $nf$ is the number of the word $t$ appears in all documents.

Import the vectorizer and a brief list of spanish stop words to facilitate the counting process.

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer
from utils import stop_words

In [27]:
tfidf = TfidfVectorizer(stop_words=stop_words)
tfidf_matrix = tfidf.fit_transform(df["description_cleaned"])

The $tfidf\_matrix$ has the following shape:

In [28]:
tfidf_matrix.shape

(263, 2107)

There are $2107$ unique words in all descritions.

#### Similarity

For the similarity, we are using the same idea as before.

In [29]:
from sklearn.metrics.pairwise import cosine_similarity
cosim_description = cosine_similarity(tfidf_matrix, tfidf_matrix)

With the shape:

In [30]:
cosim_description.shape

(263, 263)

**Note: vectors only have positive values, which imply that the cosine similarity range is between $[0, 1]$.**

### Rating

In a recommendation system, **one would like to show products that consumers like**. To do so, we gathered the `rating` and `raters` columns, representing the average vote, and the numbers of votes each product has.

Replace the missing values with $0$.

In [31]:
df["rating"].fillna(value=0., inplace=True)
df["raters"].fillna(value=0., inplace=True)

We are using the following formula to assign a weight for each product $p$, based on its rating:
$$
\text{Weighted_Rating}\hspace{0.05in}(p) = \left(r_p\cdot\frac{n_p}{n_p+1}  \right) + \left(\frac{\mu}{n_p+1}  \right)
$$

where
- $r_p$ is the rate of the product $p$,
- $n_p$ is the number of voters of the product $p$, and
- $\mu$ is the average rate of all the products.

In [32]:
def weighted_rating(rating, num_votes, mean_rate):
    return (rating*num_votes/(num_votes+1)) + (mean_rate/(num_votes+1))

In [33]:
mean_rate = np.mean(df["rating"])
weighted_rates = [weighted_rating(df["rating"][i], df["raters"][i], mean_rate) for i in df.index]

Define the similarity matrix in the range $[0, 1]$ where $0$ means no votes, while $1$ means the highest weighted rating.

In [34]:
weighted_matrix = np.tile(weighted_rates, (cosim_categorical.shape[1], 1))
weighted_matrix = np.interp(
    weighted_matrix, (weighted_matrix.min(), weighted_matrix.max()), (0, 1)
)

The matrix has the following shape.

In [35]:
weighted_matrix.shape

(263, 263)

### Price Variable

Ideally, the algorithm may not recommend products out of stock. So we define the $available\_matrix$ to be just $0$ if the product is available and $-100$ if not. This is done to prevent the algorithm from picking unavailables.

In [36]:
available = [-100 if row == "Agotado" else 0 for row in df["price"] ]
available_matrix = np.tile(available, (cosim_categorical.shape[1], 1))

In [37]:
available_matrix

array([[   0, -100,    0, ..., -100,    0,    0],
       [   0, -100,    0, ..., -100,    0,    0],
       [   0, -100,    0, ..., -100,    0,    0],
       ...,
       [   0, -100,    0, ..., -100,    0,    0],
       [   0, -100,    0, ..., -100,    0,    0],
       [   0, -100,    0, ..., -100,    0,    0]])

## Recommendations

The recommendation system is a function to which we input:
- link: the link of the product.
- similarity: the similarity matrix. A formula derived from all the previous similarity matrices.

and it outputs:
- 3 recommendations: those with the largest similarity.
- 1 random recommendation: there are so many random events in life that we cannot quantify, so every recommendation system may have room for a random recommendation. The only constraint is to be an available product.

In [38]:
import random

In [39]:
indexes = pd.Series(df.index, index=df["link"])

In [40]:
def recommendations(link, similarity):
    index = indexes[link]
    scores = list(enumerate(similarity[index]))
    scores = sorted(scores, key=lambda x: x[1], reverse=True)
    max_score = max(scores, key=lambda x: x[1])
    max_indexes = [i[0] for i in scores if i[1] == max_score[1]]
    if index in max_indexes:
        max_indexes.remove(index)
    if len(max_indexes) > 3:
        link_indexes = random.sample(max_indexes, 3)
    else:
        scores = [i[0] for i in scores]
        scores.remove(index)
        link_indexes = random.sample(scores[0:3], 3)
    return df["link"].iloc[link_indexes]

def random_recommendation():
    indexes = df[df["price"] != "Agotado"].index
    link_index = random.sample(sorted(indexes), 1)
    return df["link"].iloc[link_index]

def get_recommendations(link, similarity):
    print(f"Input link:\n {link}")
    recos = recommendations(link, similarity)
    random_reco = random_recommendation()
    print(f"Output links:")
    reco_list = [rec for rec in recos]
    reco_list.append(random_reco.iloc[0])
    reco_list = random.sample(reco_list, 4)
    for reco in reco_list:
        print(" " + reco)
    return reco_list

For the formula we are just adding each matrix, multiplied by the percentage of impact each one has.

In [41]:
formula = 0.3*cosim_categorical + 0.3*cosim_description + 0.3*weighted_matrix + available_matrix

As an example, these are the recommendatios for 

In [42]:
_ = get_recommendations(df["link"][25], formula)

Input link:
 https://adagio.cl/products/arabica-mocha
Output links:
 https://adagio.cl/products/pack-4-pomos-flavour-explotion
 https://adagio.cl/products/deleite-de-curcuma
 https://adagio.cl/products/arabica-chai
 https://adagio.cl/products/berry-blues
