# **Airbnb Content-Based Recommendation System**

![](https://assets3.thrillist.com/v1/image/1451130/size/tmg-article_main_wide_2x.jpg)

## **Introduction**
In this notebook, I will attempt at implementing a recommendation algorithms (content-based). Here we will use listings data from Airbnb around Seattle area. The engine will learn from:
1. id: listings id for every room around Seattle
2. name: the title of room listings
3. description: details given by the host to describe their rooms

## **Objective:** 
* Learning from data and recommend the best rooms around Seattle to users, based on content similarities (name and description)
* Providing more room options and increasing personalization to prospective guests

In [1]:
import sys
sys.path.insert(0, "/home/apprenant/PycharmProjects/Foodflix_part_2")

In [2]:
# Importing the libraries
import pandas as pd
from IPython.display import Image, HTML
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [3]:
from wordcloud import WordCloud, STOPWORDS

ModuleNotFoundError: No module named 'wordcloud'

In [4]:
# Importing the dataset
df = pd.read_csv('../Data/intermediate.csv', nrows=20000)
df.head(10)

Unnamed: 0.1,Unnamed: 0,product_name,generic_name,brands,categories,countries,nutrition_grade_fr,energy_100g,energy-from-fat_100g,fat_100g,...,-maltose_100g,-maltodextrins_100g,fiber_100g,proteins_100g,salt_100g,sodium_100g,fruits-vegetables-nuts_100g,nutrition-score-fr_100g,nutrition-score-uk_100g,Missing_100g
0,185,Root Beer,,A&W,"Boissons,Boissons gazeuses,Sodas,Boissons sucr...",France,e,215.0,,0.0,...,,,0.0,0.0,0.0616,0.0242,,18.0,3.0,1
1,189,"Gotta-have grape, seriously strawberry flavor",Bonbons acidulés Raisin Fraise,Nerds,"Snacks sucrés,Confiseries,Bonbons",France,d,1667.0,,0.0,...,,,,0.0,0.0,0.0,,14.0,14.0,2
2,190,Thé noir aromatisé violette et fleurs,,Alice Délice,en:beverages,France,c,17.0,,0.1,...,,,,0.1,0.001,0.000394,,2.0,0.0,2
3,193,Preparation mug cake chocolat-caramel au beurr...,,Alice Délice,,France,e,1632.0,,7.0,...,,,0.0,7.0,0.975,0.383858,,21.0,21.0,1
4,194,Mini Confettis,,Alice Délice,,France,d,1753.0,,,...,,,0.9,0.6,0.01,0.003937,,14.0,14.0,3
5,195,Praliné Amande Et Noisette,,Alice Délice,,France,d,2406.0,,,...,,,3.9,9.5,0.003,0.001181,,14.0,14.0,3
6,231,"Pepsi, Nouveau goût !",Boisson gazeuse rafraîchissante aux extraits n...,Pepsi,Sodas au cola,France,e,177.0,,0.0,...,,,0.0,0.0,0.0254,0.01,,13.0,2.0,1
7,238,Blle Pet 50CL Coca Cola Cherry,,Coca-Cola,en:beverages,France,e,180.0,,0.1,...,,,,0.0,0.0,0.0,,14.0,2.0,2
8,240,Crêpes jambon fromage,,Bo Frost,,France,b,678.0,,6.6,...,,,0.9,8.2,0.73,0.287402,,0.0,0.0,1
9,242,Tarte Poireaux Et Lardons,,Bo Frost,,France,d,1079.0,,,...,,,1.4,7.5,0.8,0.314961,,15.0,15.0,3


In [5]:
cols=['product_name', 'generic_name', 'brands', 'categories']
df=df[cols]

In [6]:
df.head()

Unnamed: 0,product_name,generic_name,brands,categories
0,Root Beer,,A&W,"Boissons,Boissons gazeuses,Sodas,Boissons sucr..."
1,"Gotta-have grape, seriously strawberry flavor",Bonbons acidulés Raisin Fraise,Nerds,"Snacks sucrés,Confiseries,Bonbons"
2,Thé noir aromatisé violette et fleurs,,Alice Délice,en:beverages
3,Preparation mug cake chocolat-caramel au beurr...,,Alice Délice,
4,Mini Confettis,,Alice Délice,


In [7]:
df.isna().sum()

product_name        0
generic_name    12030
brands            171
categories       6982
dtype: int64

In [8]:
df.dtypes

product_name    object
generic_name    object
brands          object
categories      object
dtype: object

## Exploratory Data Analysis
Are there certain words that figure more often in listings' name and description? I suspect there are some words which are occured more frequently and considered more worthy of a title. Let us find out!

In [11]:
#df['product_name'] = df['product_name'].astype('str')
#df['description'] = df['description'].astype('str')

In [13]:
name_cols = ['product_name', 'generic_name']
name_corpus = ' '.join(df[name_cols])

description_cols=['brands', 'categories']
description_corpus = ' '.join(df[description_cols])

In [None]:
name_wordcloud = WordCloud(stopwords = STOPWORDS, background_color = 'white', height = 2000, width = 4000).generate(name_corpus)
plt.figure(figsize = (16,8))
plt.imshow(name_wordcloud)
plt.axis('off')
plt.show()

In [None]:
description_wordcloud = WordCloud(stopwords = STOPWORDS, background_color = 'white', height = 2000, width = 4000).generate(description_corpus)
plt.figure(figsize = (16,8))
plt.imshow(description_wordcloud)
plt.axis('off')
plt.show()

The most frequently words occurred in the name or title of listings are including: Seattle, Capitol Hill, View, Home, Cozy, etc. This obviously represents Seattle area data with common words in room listings. Unlike for the description, here some of the tops are already specific like: house, home, apartment, living room, space. They are typical words for hosts when describing their listings.

Now, we create a column containing a combination of name and description columns that is important for content-based recommendation system

In [9]:
df['content'] = df[['product_name', 'categories']].astype(str).apply(lambda x: ' // '.join(x), axis = 1)

In [10]:
df.head()

Unnamed: 0,product_name,generic_name,brands,categories,content
0,Root Beer,,A&W,"Boissons,Boissons gazeuses,Sodas,Boissons sucr...","Root Beer // Boissons,Boissons gazeuses,Sodas,..."
1,"Gotta-have grape, seriously strawberry flavor",Bonbons acidulés Raisin Fraise,Nerds,"Snacks sucrés,Confiseries,Bonbons","Gotta-have grape, seriously strawberry flavor ..."
2,Thé noir aromatisé violette et fleurs,,Alice Délice,en:beverages,Thé noir aromatisé violette et fleurs // en:be...
3,Preparation mug cake chocolat-caramel au beurr...,,Alice Délice,,Preparation mug cake chocolat-caramel au beurr...
4,Mini Confettis,,Alice Délice,,Mini Confettis // nan


In [11]:
# Fillna
df['content'].fillna('Null', inplace = True)

In [12]:
df.head()

Unnamed: 0,product_name,generic_name,brands,categories,content
0,Root Beer,,A&W,"Boissons,Boissons gazeuses,Sodas,Boissons sucr...","Root Beer // Boissons,Boissons gazeuses,Sodas,..."
1,"Gotta-have grape, seriously strawberry flavor",Bonbons acidulés Raisin Fraise,Nerds,"Snacks sucrés,Confiseries,Bonbons","Gotta-have grape, seriously strawberry flavor ..."
2,Thé noir aromatisé violette et fleurs,,Alice Délice,en:beverages,Thé noir aromatisé violette et fleurs // en:be...
3,Preparation mug cake chocolat-caramel au beurr...,,Alice Délice,,Preparation mug cake chocolat-caramel au beurr...
4,Mini Confettis,,Alice Délice,,Mini Confettis // nan


In [13]:
df.shape

(20000, 5)

In [14]:
df["id"] = df.index

## Train the Recommender

## TF-IDF (Term Frequency - Inverse Document Frequency)
Create a TF-IDF matrix of unigrams and bigrams for each id or room. The “stop words” parameter tells the TF-IDF module to ignore common English words like 'the', ‘about’, etc.  TF-IDF will parse through the descriptions, identify distinct phrases in each item's description, and then find similar contents based on those phrases. Formula is below:

𝑤_(𝑖,𝑗)=〖𝑡𝑓〗_(𝑖,𝑗)  x log⁡(𝑁/〖𝑑𝑓〗_𝑖 )


𝑡𝑓〗_(𝑖,𝑗) = number of occurrences of i in j;
〖𝑑𝑓〗_𝑖 = number of documents containing i;
𝑁 = total number of documents

In [15]:
tf = TfidfVectorizer(analyzer = 'word', ngram_range = (1, 2), min_df = 0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(df['content'])

## Cosine Similarity
Cosine Similarity calculates similarity by measuring the cosine angle between two vectors. Vectors are converted from contents by TF-IDF and this measurement will identify which contents are closest to each other.


In [16]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

Iterate through each item's similar items and store the 100 most-similar!

In [17]:
results = {}
for idx, row in df.iterrows():
    similar_indices = cosine_similarities[idx].argsort()[:-100:-1]
    similar_items = [(cosine_similarities[idx][i], df['id'][i]) for i in similar_indices]
    results[row['id']] = similar_items[1:]

## Let's make a prediction

Create two functions for showing the recommender results:
1. Function to get a friendly item name and description from the content field, given an item ID
2. Function to read the results out of the dictionary

In [18]:
def item(id):
    name   = df.loc[df['id'] == id]['content'].tolist()[0].split(' // ')[0]
    desc   = ' \nDescription: ' + df.loc[df['id'] == id]['content'].tolist()[0].split(' // ')[1][0:165] + '...'
    prediction = name  + desc
    return prediction

def recommend(item_id, num):
    print('Recommending ' + str(num) + ' products similar to ' + item(item_id))
    print('---')
    recs = results[item_id][:num]
    for rec in recs:
        print('\nRecommended: ' + item(rec[1]) + '\n(score:' + str(rec[0]) + ')')

Finally, try to put an id from listings data and number of recommendation to show

In [19]:
recommend(item_id = 3, num = 5)

Recommending 5 products similar to Preparation mug cake chocolat-caramel au beurre salé 
Description: nan...
---

Recommended: Caramel au beurre salé 
Description: nan...
(score:0.6087937221862616)

Recommended: Caramel beurre salé 
Description: nan...
(score:0.4403439816325832)

Recommended: Caramels au beurre salé 
Description: Chocolat,Caramel...
(score:0.4078820348478259)

Recommended: Crème Caramel Beurre Salé 
Description: nan...
(score:0.3868191192630289)

Recommended: Crème de Caramel au Beurre Salé au Sel de Guérande 
Description: Pâtes à tartiner au caramel...
(score:0.30855836329767344)


## Moteur de recherche pour 1 produit

In [21]:
product = input()

Chocolat


In [22]:
print(product)

Chocolat


In [23]:
tf = TfidfVectorizer(analyzer = 'word', ngram_range = (1, 2), min_df = 0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(df['content'])

In [24]:
tfidf_product = tf.transform([product])

In [33]:
cosine_similarities = linear_kernel(tfidf_product, tfidf_matrix)

In [34]:
results = {}
similar_indices = cosine_similarities[0].argsort()[:-5:-1]
similar_items = [(cosine_similarities[0][i], df['id'][i]) for i in similar_indices]
results = similar_items

In [35]:
print(results)

[(0.49308786889530515, 11618), (0.49308786889530515, 11390), (0.49308786889530515, 12390), (0.49308786889530515, 17657)]


In [36]:
for i in range(4):
    print(df.iloc[results[i][1]])

product_name           Chocolat
generic_name                NaN
brands                    Lindt
categories                  NaN
content         Chocolat // nan
id                        11618
Name: 11618, dtype: object
product_name           Chocolat
generic_name                NaN
brands                    Milka
categories                  NaN
content         Chocolat // nan
id                        11390
Name: 11390, dtype: object
product_name           Chocolat
generic_name                NaN
brands              Cacao Barry
categories                  NaN
content         Chocolat // nan
id                        12390
Name: 12390, dtype: object
product_name           Chocolat
generic_name                NaN
brands                  Cacolac
categories                  NaN
content         Chocolat // nan
id                        17657
Name: 17657, dtype: object


## les resultats que j'avais avant

In [114]:
#print(results)

[(1.0000000000000002, 19999), (0.43607050038135664, 11905), (0.4090458323517676, 8464), (0.4090458323517676, 8490)]


[(1.0000000000000002, 19999), (0.43607050038135664, 11905), (0.4090458323517676, 8464), (0.4090458323517676, 8490)]


In [117]:
# for i in range(4):
#     print(df.iloc[results[i][1]])

product_name                          Bat'o choc chocolat au lait
generic_name                                                  NaN
brands                                                     Casino
categories                            Pockys,Biscuits au chocolat
content         Bat'o choc chocolat au lait // Pockys,Biscuits...
id                                                          19999
Name: 19999, dtype: object
product_name                              Pépito chocolat au lait
generic_name                         Biscuits au chocolat au lait
brands                                                         Lu
categories      Biscuits au chocolat,Biscuits-au-chocolat-au-lait
content         Pépito chocolat au lait // Biscuits au chocola...
id                                                          11905
Name: 11905, dtype: object
product_name                             Granola Chocolat au Lait
generic_name          Biscuits sablés nappés de chocolat au lait 
brands                

## Une façon de faire

In [209]:
product = input()

Chocolat


In [27]:
print(product)

Chocolat


In [28]:
tf = TfidfVectorizer(analyzer = 'word', ngram_range = (1, 2), min_df = 0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(df['content'])

In [29]:
tfidf_product = tf.transform([product])

In [30]:
cosine_similarities = linear_kernel(tfidf_product, tfidf_matrix).flatten()

In [31]:
results2 = cosine_similarities.argsort()[:-5:-1]
print(results2)

[11618 11390 12390 17657]


In [32]:
for i in range(4):
    print(df.iloc[results2[i]])

product_name           Chocolat
generic_name                NaN
brands                    Lindt
categories                  NaN
content         Chocolat // nan
id                        11618
Name: 11618, dtype: object
product_name           Chocolat
generic_name                NaN
brands                    Milka
categories                  NaN
content         Chocolat // nan
id                        11390
Name: 11390, dtype: object
product_name           Chocolat
generic_name                NaN
brands              Cacao Barry
categories                  NaN
content         Chocolat // nan
id                        12390
Name: 12390, dtype: object
product_name           Chocolat
generic_name                NaN
brands                  Cacolac
categories                  NaN
content         Chocolat // nan
id                        17657
Name: 17657, dtype: object


In [45]:
choices = []
for i in range(4):
    choices.append(df.iloc[results2[i]])
print(choices)

[product_name           Chocolat
generic_name                NaN
brands                    Lindt
categories                  NaN
content         Chocolat // nan
id                        11618
Name: 11618, dtype: object, product_name           Chocolat
generic_name                NaN
brands                    Milka
categories                  NaN
content         Chocolat // nan
id                        11390
Name: 11390, dtype: object, product_name           Chocolat
generic_name                NaN
brands              Cacao Barry
categories                  NaN
content         Chocolat // nan
id                        12390
Name: 12390, dtype: object, product_name           Chocolat
generic_name                NaN
brands                  Cacolac
categories                  NaN
content         Chocolat // nan
id                        17657
Name: 17657, dtype: object]


### Autre produit

In [217]:
product = input()

caramel


In [218]:
print(product)

caramel


In [219]:
tf = TfidfVectorizer(analyzer = 'word', ngram_range = (1, 2), min_df = 0, stop_words = 'english')
tfidf_matrix = tf.fit_transform(df['content'])

In [220]:
tfidf_product = tf.transform([product])

In [221]:
cosine_similarities = linear_kernel(tfidf_product, tfidf_matrix).flatten()

In [222]:
results2 = cosine_similarities.argsort()[:-5:-1]
print(results2)

[ 9316  9476  9486 17205]


In [223]:
for i in range(4):
    print(df.iloc[results2[i]])

product_name                     Le Petit Pot de Crème au Caramel
generic_name    Dessert lacté aux oeufs frais et caramel, cuit...
brands                                         Nestlé,La Laitière
categories                                 Crèmes dessert caramel
content         Le Petit Pot de Crème au Caramel // Crèmes des...
id                                                           9316
Name: 9316, dtype: object
product_name                           Velours de Crème (Caramel)
generic_name                             Crème dessert au caramel
brands                                         Nestlé,La Laitière
categories      Frais,Produits laitiers,Desserts,Desserts lact...
content         Velours de Crème (Caramel) // Frais,Produits l...
id                                                           9476
Name: 9476, dtype: object
product_name                           Velours de Crème (Caramel)
generic_name                             Crème dessert au caramel
brands                  

# Count Vectorizer

In [37]:
from sklearn.feature_extraction.text import CountVectorizer

In [21]:
product = input()

Chocolat


In [38]:
print(product)

Chocolat


In [39]:
vectorizer = CountVectorizer()
vect_matrix = vectorizer.fit_transform(df['content'])

In [40]:
vect_product = vectorizer.transform([product])

In [41]:
cosine_similarities = linear_kernel(vect_product, vect_matrix)

In [42]:
results = {}
similar_indices = cosine_similarities[0].argsort()[:-5:-1]
similar_items = [(cosine_similarities[0][i], df['id'][i]) for i in similar_indices]
results = similar_items

In [43]:
print(results)

[(6.0, 359), (5.0, 17227), (4.0, 9497), (4.0, 9802)]


In [44]:
for i in range(4):
    print(df.iloc[results[i][1]])

product_name    Sundae Triple Chocolat avec du chocolat belge,...
generic_name                                                  NaN
brands                                            Marks & Spencer
categories      Snacks sucrés,Produits laitiers,Desserts,Surge...
content         Sundae Triple Chocolat avec du chocolat belge,...
id                                                            359
Name: 359, dtype: object
product_name                               Crunchy Nappé Chocolat
generic_name                                                  NaN
brands                                                  St Michel
categories      Snacks sucrés,Biscuits et gâteaux,Biscuits,Bis...
content         Crunchy Nappé Chocolat // Snacks sucrés,Biscui...
id                                                          17227
Name: 17227, dtype: object
product_name          Le Petit Pot de Crème au Chocolat (8 Pots) 
generic_name    Dessert lacté aux œufs et au chocolat, cuit au...
brands                  