# Adagio Recommendation System

## Introduction

One way to keep a consumer longer on a page is by showing products that may interest him. To do so, one should build a **recommendation system**.

There exists different models. **A simple one is to just recommend other products from the same category**. For example, if someone is looking for a TV on an electronic website, the recommendations can be other TVs (nor refrigerators, coffee makers, so on).

A more sophisticated way, is to **recommend the products that are the most similars**. For example, if the same person is looking for a low consumption TV on the same website, the recommendations can be other low consumption electronics (not high consumption TVs).

The latest carries a problem. **We first need to define a metric to measure similarities**. That is, assign an ordered value (a number) to each pair of product, where the largest means more similarity.

In this notebook we obtained similarities from the data collected (from [Adagio website](https://adagio.cl/)). And created an algorithm capable of recommeding tea related products.

---

Import the basic necessary libraries.

In [1]:
import numpy as np
np.random.seed(151515)
import pandas as pd

### Import Dataset

Import the dataset.

In [2]:
df = pd.read_csv("data_scraper/products.csv")

The data looks like:

In [3]:
df.head(5)

Unnamed: 0,amount,benefit,description_1,description_2,format,link,name,price,property_1,property_2,property_3,property_4,property_5,raters,rating,temperature,time
0,"2,5g/250ml",negro,"te negro confeti verde, ojos de caramelo y sab...","te negro confeti verde, ojos de caramelo y sab...",57 gramos,https://adagio.cl/products/monster-mash,monster mash,5990,energizante,,,,,1.0,3.0,100 celsius,3 min
1,"2,5g/250ml",rooibos,"rooibos verde, full antioxidante, con toques d...","rooibos verde, manzanilla, cascara de naranja,...",85 gramos,https://adagio.cl/products/rooibos-tropical,rooibos verde tropical,5990,relajante,sin-cafeina,,,,1.0,5.0,100 celsius,5 min
2,2g por taza,hierba,la primavera ya llego y con ella un clima mas ...,"hojas de mora, menta, hibisco, flores de lavan...",57 gramos,https://adagio.cl/products/feel-the-spring,feel the spring,5990,relajante,sin-cafeina,,,,,,100 celsius,5-10 min
3,"2,5g/250ml",oolong,nuestro blackberry sage oolong es una mezcla e...,"te oolong, salvia, hojas de frambuesa, sabor n...",57 gramos,https://adagio.cl/products/oolong-mora,oolong salvia mora,5990,digestion,,,,,1.0,2.0,100 celsius,3-5 min
4,,,prepara tu te favorto de forma facil con la cu...,echar en un taza la cantidad de cucharadas ind...,fucsia,https://adagio.cl/products/cuchara-gold-fucsia,cuchara gold fucsia,3490,,,,,,,,,


We have the following information of the dataset.

In [4]:
df.info()
print(f"\nDataFrame Shape: {df.shape}")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 643 entries, 0 to 642
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   amount         385 non-null    object 
 1   benefit        383 non-null    object 
 2   description_1  637 non-null    object 
 3   description_2  639 non-null    object 
 4   format         556 non-null    object 
 5   link           643 non-null    object 
 6   name           643 non-null    object 
 7   price          643 non-null    object 
 8   property_1     407 non-null    object 
 9   property_2     255 non-null    object 
 10  property_3     56 non-null     object 
 11  property_4     5 non-null      object 
 12  property_5     0 non-null      float64
 13  raters         454 non-null    float64
 14  rating         454 non-null    float64
 15  temperature    385 non-null    object 
 16  time           385 non-null    object 
dtypes: float64(3), object(14)
memory usage: 85.5+ KB

Data

drop the `property_5` column, as it has 0 non-null rows.

In [5]:
df.drop(columns="property_5", inplace=True)

### Remove Duplicates

Duplicates can affect the model. The ideal is not to recommend the same product twice.

In [6]:
duplicates_number = sum(df["link"].duplicated())
print(f"Number of duplicated links: {duplicates_number}")

Number of duplicated links: 379


Drop all the duplicated links.

In [7]:
df.drop_duplicates(inplace=True)
df = df.sample(frac=1, random_state=5454)
df.reset_index(drop=True, inplace=True)
df = df.copy()

In [8]:
print(f"Final DataFrame shape: {df.shape}")

Final DataFrame shape: (264, 16)


Finally, we have the following information.

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 264 entries, 0 to 263
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   amount         154 non-null    object 
 1   benefit        153 non-null    object 
 2   description_1  262 non-null    object 
 3   description_2  261 non-null    object 
 4   format         219 non-null    object 
 5   link           264 non-null    object 
 6   name           264 non-null    object 
 7   price          264 non-null    object 
 8   property_1     162 non-null    object 
 9   property_2     91 non-null     object 
 10  property_3     21 non-null     object 
 11  property_4     2 non-null      object 
 12  raters         182 non-null    float64
 13  rating         182 non-null    float64
 14  temperature    154 non-null    object 
 15  time           154 non-null    object 
dtypes: float64(2), object(14)
memory usage: 33.1+ KB


## Similarities

### Categorical Variables

Some products have labels like *te negro*, *energizante* and so on. The algorithm should recommend products with similar labels.

We have the following labels values:

In [13]:
print(f"Unique Values in format column:\n {df.format.unique()}\n")
print(f"Unique Values in benefit column:\n {df.benefit.unique()}\n")
print(f"Unique Values in property_1 column:\n {df.property_1.unique()}\n")
print(f"Unique Values in property_2 column:\n {df.property_2.unique()}\n")
print(f"Unique Values in property_3 column:\n {df.property_3.unique()}\n")
print(f"Unique Values in property_4 column:\n {df.property_4.unique()}\n")

Unique Values in format column:
 ['' '   gramos' '   teabags' 'dorada' 'hojas naranjo' 'verde'
 'turquesa infusor dorado' 'full color lila' 'menta' 'iridiscent'
 'pack good moments' 'azul' 'lila' 'estrella turquesa' 'grafito'
 'amarillo' 'negro' 'blanco' 'negra' 'full color menta'
 'moonstar turquesa' '   piramides' 'iridescent lila' 'turquesa'
 'dark blue' 'roja' '   gramos      bolitas' 'celeste' 'estrella salmon'
 'mug bhoro ceramica negro' 'rojo' 'black' 'estrella fucsia'
 'transparente' 'lila granito' 'grande' 'fucsia' 'infusor grafito'
 'nubes turquesa' 'turquesa granito']

Unique Values in benefit column:
 ['' 'matcha' 'negro' 'hierba' 'verde' 'blanco' 'oolong' 'rooibos' 'rojo'
 'amarillo']

Unique Values in property_1 column:
 ['' 'antioxidante' 'energizante' 'relajante' 'sin-cafeina' 'digestion']

Unique Values in property_2 column:
 ['' 'digestion' 'energizante' 'sin-cafeina' 'relajante']

Unique Values in property_3 column:
 ['' 'energizante' 'relajante' 'sin-cafeina']

Uniq

Fill the NAs with blank text.

In [11]:
df["format"].fillna(value="", inplace=True)
df["benefit"].fillna(value="", inplace=True)
df["property_1"].fillna(value="", inplace=True)
df["property_2"].fillna(value="", inplace=True)
df["property_3"].fillna(value="", inplace=True)
df["property_4"].fillna(value="", inplace=True)

Clean the text to not have special characters.

In [12]:
def clean_text(string):
    for character in string:
        if character not in "abcdefghijklmnopqrstuvwxyz ":
            string = string.replace(str(character), " ")
    return string

def clean_cafeina(string):
    for character in string:
        if character == "-":
            string = string.replace(str(character), "")
    return string

df["format"] = df["format"].apply(lambda x: clean_text(x))
df["properties"] = df["property_1"] + " " + df["property_2"] + " " \
    + df["property_3"] + " " + df["property_4"]
df["properties"] = df["properties"].apply(lambda x: x.strip())
df["properties"] = df["properties"].apply(lambda x: clean_cafeina(x))

Put all properties together in `all_properties` column.

In [14]:
df["all_properties"] = df["format"] + " " + df["benefit"] + " " + df["properties"]
df["all_properties"] = df["all_properties"].apply(lambda x: x.strip())
df["all_properties"] = df["all_properties"].apply(lambda x: " ".join(x.split()))

#### Cosine Similarity

To measure how much one product looks like another, **we need to represent each item as a vector of some space**. This process is known as *vectorizer*.
The column `all_properties` consists of sets of words. One can count how many times each word is repeated. This can be done using the $CountVectorizer$ class from scikit-learn.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
count = CountVectorizer()
count_matrix = count.fit_transform(df["all_properties"])

We get the $count$_$matrix$ of shape:

In [19]:
count_matrix.shape

(264, 52)

Where each row of the `all_properties` column is represented as a row in the $264\times 52$ matrix. And the $52$ columns corresponds to the $52$ unique labels.

For example, the $2^{nd}$ row of `all_properties` says:

In [22]:
print(df["all_properties"][1])

gramos matcha antioxidante digestion energizante


And the second row of the $count$_$matrix$ is:

In [23]:
count_matrix.toarray()[1]

array([0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0])

We see that the $1$s corresponds to the labels that appears on the $2^{nd}$ row. While the $0$s to the labels that are not in the $2^{nd}$ row.

In total, we have $264$ vectors inside a $52$-dimensional vector space.

---
Now that we have the vectors, we must find a way to measure distances. One can be obtained using the *[dot product](https://en.wikipedia.org/wiki/Dot_product#Geometric_definition)* of two vectors $\vec{v}$ and $\vec{w}$:

$$\vec{v}\cdot\vec{w}=\|v\|\|w\|\cos \theta,$$

where $\theta$ is the angle between $\vec{v}$ and $\vec{w}$, and $\|\cdot\|$ is the euclidean norm.

So we are calculating the angle:
$$
\cos\theta = \frac{\vec{v}\cdot\vec{w}}{\|v\|\|w\|}
$$
to measure the distance between two vectors. As **lower angle cosines imply closers vectors**.

For example, $\cos\theta_{1} = \frac{\sqrt{2}}{2}$ is higher than $\cos\theta_{2} =\frac{-\sqrt{2}}{2}$. which implies that the distance between $v$ and $w_{1}$ is lower than the distance between $v$ and $w_{2}$.

Note that the vectors belongs to a $52$-dimensional hypercube.


<img src="images/angle1.jpeg" width=300/> <img src="images/angle2.jpeg" width=445/>

In [27]:
from sklearn.metrics.pairwise import cosine_similarity

Create the cosine similarity matrix.

In [25]:
cosim_categorical = cosine_similarity(count_matrix, count_matrix)

The cosine similarity matrix has the following shape:

In [26]:
cosim_categorical.shape

(264, 264)

### Description

In [20]:
sum(df["description_1"].isna())

2

In [21]:
sum(df["description_2"].isna())

3

In [22]:
df["description_1"].fillna(value="", inplace=True)
df["description_2"].fillna(value="", inplace=True)

In [23]:
def clean_text(string):
    for character in string:
        if character not in "abcdefghijklmnopqrstuvwxyz ":
            string = string.replace(str(character), " ")
    return string

In [24]:
df["description_cleaned"] = df["description_1"].apply(lambda x: clean_text(x)) + \
    " " + df["description_2"].apply(lambda x: clean_text(x))

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
from utils import stop_words

In [26]:
tfidf = TfidfVectorizer(stop_words=stop_words)
tfidf_matrix = tfidf.fit_transform(df["description_cleaned"])
tfidf_matrix.shape

(264, 2107)

In [27]:
from sklearn.metrics.pairwise import linear_kernel
cosim_description = linear_kernel(tfidf_matrix, tfidf_matrix)

### Rating

In [28]:
df["rating"].fillna(value=0., inplace=True)
df["raters"].fillna(value=0., inplace=True)

In [29]:
def weighted_rating(rating, num_votes, mean_rate, min_vote=1):
    return (rating*num_votes/(num_votes+min_vote)) + (mean_rate*min_vote/(num_votes+min_vote))

In [30]:
mean_rate = np.mean(df["rating"])
weighted_rates = [weighted_rating(df["rating"][i], df["raters"][i], mean_rate) for i in df.index]

In [31]:
weighted_matrix = np.diag(weighted_rates)

### Price Variable

In [32]:
available = [0 if row == "Agotado" else 1 for row in df["price"] ]
available_matrix = np.diag(available)

## Recommendations

In [33]:
import random

In [34]:
indexes = pd.Series(df.index, index=df["link"])

In [35]:
def recommendations(link, similarity):
    index = indexes[link]
    scores = list(enumerate(similarity[index]))
    scores = sorted(scores, key=lambda x: x[1], reverse=True)
    max_score = max(scores, key=lambda x: x[1])
    max_indexes = [i[0] for i in scores if i[1] == max_score[1]]
    if index in max_indexes:
        max_indexes.remove(index)
#    print(f"max: {max_indexes}")
    if len(max_indexes) > 3:
        link_indexes = random.sample(max_indexes, 3)
    else:
        scores = [i[0] for i in scores]
        scores.remove(index)
        link_indexes = random.sample(scores[0:3], 3)
#    print(link_indexes) 
    return df["link"].iloc[link_indexes]

def random_recommendation():
    indexes = df[df["price"] != "Agotado"].index
    link_index = random.sample(sorted(indexes), 1)
    return df["link"].iloc[link_index]

def get_recommendations(link, similarity):
    print(f"Input link:\n {link}")
    recos = recommendations(link, similarity)
    random_reco = random_recommendation()
    print(f"Output links:")
    reco_list = [rec for rec in recos]
    #reco_list.append(random_reco.iloc[0])
    #reco_list = random.sample(reco_list, 4)
    for reco in reco_list:
        print(" " + reco)
    
    print(f"random: {random_reco.iloc[0]}")

In [36]:
formula = np.matmul(np.matmul(0.3*cosim_categorical+0.3*cosim_description, 0.3*weighted_matrix), available_matrix)

In [37]:
get_recommendations(df["link"][154], formula)

Input link:
 https://adagio.cl/products/berry-blues
Output links:
 https://adagio.cl/products/deleite-de-curcuma
 https://adagio.cl/products/explosion-de-berries
 https://adagio.cl/products/berry-cream
random: https://adagio.cl/products/infusor-bhoro-rojo
