# Creating a Recommender System for Airbnb to Enhance User Experience, Retention and Boost Business

GROUP 2
*  > Mercy Onduso
*  > Nurulain Abdi
*  > Amos Kibet
*  > Beth Mithamo

# Overview

Airbnb is a global online marketplace that offers housing and other accommodations to travelers. The platform has grown significantly in popularity over the years, with millions of hosts and guests using the platform for their travel needs.


The aim of this project is to create a recommender system that will help stakeholders and clients have a better strategy in decision making. The system will also help stakeholders do proper renovations of their listings efficiently without inconveniencing their clients.


# Business Problem 

Airbnb has become a popular alternative to traditional hotels for tourists and visitors in Cape Town, South Africa. However, despite its many advantages, users often face several challenges when using the platform. These include poor recommendations, unreliable pricing, and subpar customer experience. Moreover, stakeholders often struggle to renovate their listings to meet the needs of their target customers.
A South-Africa based housing company wants to venture into the Airbnb business and needs to create a sustainable and profitable business model that can compete with established players in the market. The company's stakeholders aims at ensuring customer retention,customer satisfaction and boost their business as a new party entity in the Airbnb Platform.  As Data Scientists, we are expected to address questions as well as provide recommendations.

Some of the questions we are expected to answer are:
1. What is the best month to visit Cape Town if you are on a budget?
2. What is the best time to list your property on Airbnb? And how do set price rates according to the time of the year?
3. What is the best time in the year when owners can take down their listing for maintenance and repair?
4. When is the best time to lure clients with offers in the case of an upcoming low season: Time series analysis


# Data

We extracted the data from InsideAirbnb which has data from the Airbnb platform. The link to the dataset is provided here: "http://insideairbnb.com/get-the-data/"

The data from the Airbnb app provides insights into the availability, pricing, and characteristics of short-term rental properties, such as apartments, houses, and rooms. The data can be used to understand the demand and supply dynamics in the market, as well as the preferences of guests and hosts. The data can also help identify trends and patterns in guest behavior, such as popular locations, amenities, and property types. Additionally, the data will be used in the development of a recommender systems that can make personalized recommendations to guests based on their preferences and past behavior. This will help both stakeholders and their clients have better strategies during decision making.


# Airbnb Exploration

*  Who needs the Recommender System? The Host- A South Africa-based Housing Company 
*  Which technologies does the Airbnb platform use in providing recommendations as of now? Cookies, Mobile Identifiers, Tracking URLs, log data.
* Airbnb uses two machine learning models in predicting prices for clients: Smart Pricing and Price Tips. 



# Possible Algorithms 

1. Collaborative Filtering: This algorithm is based on the idea that people who have similar preferences in the past are likely to have similar preferences in the future. Collaborative filtering can be further divided into two types: user-based and item-based. In user-based collaborative filtering, recommendations are made based on the preferences of similar users. In item-based collaborative filtering, recommendations are made based on the similarity between items.

2. Content-Based Filtering: This algorithm is based on the idea that recommendations can be made based on the characteristics of the items being recommended. For example, if a user has shown a preference for properties with a specific location or amenities, a content-based filtering algorithm can recommend similar properties based on these characteristics.

3. Matrix Factorization: This algorithm is based on the idea that the preferences of users and items can be represented in a lower-dimensional space. Matrix factorization algorithms try to find latent factors that explain the observed preferences of users and items, and use these factors to make recommendations.

4. Hybrid Algorithms: Hybrid algorithms combine two or more recommendation techniques to make more accurate and diverse recommendations. For example, a hybrid algorithm could combine collaborative filtering and content-based filtering to provide a more personalized and diverse set of recommendations.



# Data Understanding

The data contains distinct features that will help in the anaysis and prediction. These are:
1. 'id' : The unique identifier for the clients
2. 'name' : The name of the hotel/apartment
3. 'host_id' :  The unique identifier for the various hosts
4. 'host_name' : The name of the host
5. 'neighbourhood_group': The area where the apartment/hotel is geographically located
6.  'neighbourhood' : Name of the neighbourhood
7.  'latitude' and  'longitude' : The exact geographical location
8.  'room_type' : The type of room, whether it was a single private room or an entire apartment
9.  'price' : The price 
10. 'minimum_nights' : Minimum night spent by clients
11. 'number_of_reviews': Number of reviews from the clients
12. 'last_review': When the last review was
13. 'reviews_per_month'
14. 'calculated_host_listings_count',
15. 'availability_365', 
16. 'number_of_reviews_ltm',
17. 'license'


In [2]:
#import libraries
import pandas as pd
import numpy as np

In [21]:
#import listing dataset 
url = "https://raw.githubusercontent.com/MercyMoraa/InsideAirbnb/main/listing%20final3.csv"
listing_df = pd.read_csv(url,encoding='ISO-8859-1' )

print("Dataset shape:",listing_df.shape)
listing_df.head()

Dataset shape: (9481, 75)


  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,1625734,https://www.airbnb.com/rooms/1625734,20221200000000.0,12/29/2022,city scrape,"Fabulous Villa, Large Entertainment Area Patio...",You will love this fabulous Villa! Expansive l...,Located next to the Constantia Wine Lands home...,https://a0.muscache.com/pictures/ca9e8f0f-d55a...,8643899,...,5.0,5.0,5.0,,f,4,3,1,0,0.1
1,1626659,https://www.airbnb.com/rooms/1626659,20221200000000.0,12/29/2022,city scrape,Big Bay La Paloma - 2 bedroom suite,Enjoy the beautiful Bloubergstrand ~ Cape Town...,Big Bay Homestay is situated in the the tranqu...,https://a0.muscache.com/pictures/31710939/c96e...,5646468,...,5.0,4.83,4.92,,f,2,0,2,0,0.11
2,736534,https://www.airbnb.com/rooms/736534,20221200000000.0,12/29/2022,city scrape,Enjoy a Private Room in Chartfield Guesthouse,Awaken in a stylish private room to fresh coff...,"A gem on the False Bay coastline, Kalk Bay is ...",https://a0.muscache.com/pictures/monet/Select-...,3007248,...,4.75,5.0,4.83,,t,18,13,5,0,0.82
3,742345,https://www.airbnb.com/rooms/742345,20221200000000.0,12/30/2022,city scrape,Room with a View - Green Point,Fully furnished studio apartment with a fully ...,,https://a0.muscache.com/pictures/10201360/7ea2...,3886732,...,4.0,4.33,3.83,,t,80,79,1,0,0.29
4,3191,https://www.airbnb.com/rooms/3191,20221200000000.0,12/29/2022,city scrape,Malleson Garden Cottage,"This is a lovely, separate, self-catering cott...","Mowbray is on the Southern Suburbs line, 6km (...",https://a0.muscache.com/pictures/697022/385407...,3754,...,4.96,4.75,4.79,,t,1,1,0,0,0.6


In [22]:
#Investigate listing dataset
listing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9481 entries, 0 to 9480
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            9481 non-null   int64  
 1   listing_url                                   9481 non-null   object 
 2   scrape_id                                     9481 non-null   float64
 3   last_scraped                                  9481 non-null   object 
 4   source                                        9481 non-null   object 
 5   name                                          9481 non-null   object 
 6   description                                   9363 non-null   object 
 7   neighborhood_overview                         7139 non-null   object 
 8   picture_url                                   9481 non-null   object 
 9   host_id                                       9481 non-null   i

In [33]:
#Check for null values in the listing dataset
null_count = listing_df.isna().sum()

# select columns with null values greater than zero
null_cols = null_count.loc[null_count > 0]

# display the null values by column
print(null_cols)

description                      118
neighborhood_overview           2342
host_location                   1224
host_about                      3737
host_response_time              1671
host_response_rate              1671
host_acceptance_rate            1286
host_is_superhost                 19
host_neighbourhood              9453
neighbourhood                   2342
neighbourhood_group_cleansed    9481
bathrooms                       9481
bathrooms_text                    32
bedrooms                         576
beds                              57
minimum_minimum_nights             2
maximum_minimum_nights             2
minimum_maximum_nights             2
maximum_maximum_nights             2
minimum_nights_avg_ntm             2
maximum_nights_avg_ntm             2
calendar_updated                9481
first_review                    1396
last_review                     1396
review_scores_rating            1396
review_scores_accuracy          1514
review_scores_cleanliness       1513
r

In [17]:
#import review dataset 
url = "https://raw.githubusercontent.com/MercyMoraa/InsideAirbnb/main/review%20final1.csv"
review_df = pd.read_csv(url, encoding='ISO-8859-1')

print("Dataset shape:",review_df.shape)
review_df.head()

Dataset shape: (13649, 6)


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,3191,4888238.0,5/31/2013,5737473,Kathleen,Great home away from home! Bridgette and Marth...
1,3191,9128602.0,12/9/2013,8170322,Anita,Das Cottage liegt ruhig und sicher. Wir haben...
2,3191,9924130.0,1/20/2014,4039279,Zacki,This cottage was a great base from which to ex...
3,3191,16659537.0,7/31/2014,9729939,Doug,I had a great stay. All my needs were well ex...
4,3191,23247470.0,11/26/2014,9681619,Christopher,Excellent host. She provided everything we cou...


In [18]:
#Investigate review dataset
review_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13649 entries, 0 to 13648
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   listing_id     13649 non-null  int64  
 1   id             13649 non-null  float64
 2   date           13649 non-null  object 
 3   reviewer_id    13649 non-null  int64  
 4   reviewer_name  13649 non-null  object 
 5   comments       13649 non-null  object 
dtypes: float64(1), int64(2), object(3)
memory usage: 639.9+ KB


In [27]:
#import calendar dataset 
url = "https://raw.githubusercontent.com/MercyMoraa/InsideAirbnb/main/calendar%20final.csv"
cal_df = pd.read_csv(url)

print("Dataset shape:",cal_df.shape)
cal_df.head()

Dataset shape: (13649, 14)


Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13
0,736534,12/29/2022,f,"$1,250.00","$1,250.00",1,1125,,,,,,,
1,736534,12/30/2022,f,"$1,250.00","$1,250.00",2,1125,,,,,,,
2,736534,12/31/2022,f,"$1,250.00","$1,250.00",2,1125,,,,,,,
3,736534,1/1/2023,f,"$1,250.00","$1,250.00",1,1125,,,,,,,
4,736534,1/2/2023,f,"$1,250.00","$1,250.00",1,1125,,,,,,,


In [28]:
#Investigate calendar dataset
cal_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13649 entries, 0 to 13648
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   listing_id      13649 non-null  int64  
 1   date            13649 non-null  object 
 2   available       13649 non-null  object 
 3   price           13649 non-null  object 
 4   adjusted_price  13649 non-null  object 
 5   minimum_nights  13649 non-null  int64  
 6   maximum_nights  13649 non-null  int64  
 7   Unnamed: 7      0 non-null      float64
 8   Unnamed: 8      0 non-null      float64
 9   Unnamed: 9      0 non-null      float64
 10  Unnamed: 10     0 non-null      float64
 11  Unnamed: 11     0 non-null      float64
 12  Unnamed: 12     0 non-null      float64
 13  Unnamed: 13     0 non-null      float64
dtypes: float64(7), int64(3), object(4)
memory usage: 1.5+ MB


In [29]:
#Check for null values on the calendar dataset
cal_df.isna().sum()

listing_id            0
date                  0
available             0
price                 0
adjusted_price        0
minimum_nights        0
maximum_nights        0
Unnamed: 7        13649
Unnamed: 8        13649
Unnamed: 9        13649
Unnamed: 10       13649
Unnamed: 11       13649
Unnamed: 12       13649
Unnamed: 13       13649
dtype: int64

In [31]:
#drop irrelevant columns
cal_df = cal_df.dropna(axis=1)
print(cal_df.isna().sum())
cal_df.head()

listing_id        0
date              0
available         0
price             0
adjusted_price    0
minimum_nights    0
maximum_nights    0
dtype: int64


Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,736534,12/29/2022,f,"$1,250.00","$1,250.00",1,1125
1,736534,12/30/2022,f,"$1,250.00","$1,250.00",2,1125
2,736534,12/31/2022,f,"$1,250.00","$1,250.00",2,1125
3,736534,1/1/2023,f,"$1,250.00","$1,250.00",1,1125
4,736534,1/2/2023,f,"$1,250.00","$1,250.00",1,1125


In [6]:
data.isnull().sum()

id                                    0
name                                  1
host_id                               0
host_name                             0
neighbourhood_group               19670
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                        5017
reviews_per_month                  5017
calculated_host_listings_count        0
availability_365                      0
number_of_reviews_ltm                 0
license                           19608
dtype: int64

In [7]:
data.nunique()

id                                19670
name                              19292
host_id                           10686
host_name                          4401
neighbourhood_group                   0
neighbourhood                        90
latitude                          13713
longitude                         13303
room_type                             4
price                              4224
minimum_nights                       45
number_of_reviews                   308
last_review                        1429
reviews_per_month                   582
calculated_host_listings_count       52
availability_365                    366
number_of_reviews_ltm               107
license                              46
dtype: int64

# Data Cleansing