# AirBnB Web Scrapping

The aim of this assignment is to create and save a dataset containing information about different listings in Airbnb. You will then use this dataset during the Artifical Intelligence I course to train a predictive model.

## Getting started

[Airbnb](https://www.airbnb.com/) allows people to rent out their properties on their online platform. Travelers can then book these properties for shorter or longer periods of time. The company was founded in August 2008 in San Francisco, California, and currently has an annual revenue stream of over 2.5 Billion US Dollars. In the US alone, the platform has 660,000 listings.

![airbnb](https://www.dropbox.com/s/njll910mmpzm86z/airbnb.png?raw=1)

Every individual listing contains a lot of information like the facilities offered, the location, information about the host and reviews. In this assignment you will build a web scrapper to extract information from these listings using Python.

![bali](https://www.dropbox.com/s/5gnj4dsv1qmvji5/bali.png?raw=1)

The fact that we are confined at home shouldn't prevent us from dreaming we could go somewhere else. So, let's take a look at the different listings available to spend 5 nights in Bali during these Christmas holidays, from December 29 until January 3. You can check the different options available by in the following [link](https://www.airbnb.com/s/Bali--Indonesia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&query=Bali%2C%20Indonesia&place_id=ChIJoQ8Q6NNB0S0RkOYkS7EPkSQ&checkin=2020-12-29&checkout=2021-01-03&source=structured_search_input_header&search_type=autocomplete_click).

The whole url has been copied for you below.

In [1]:
url = "https://www.airbnb.com/s/Bali--Indonesia/homes?tab_id=home_tab&refinement_paths%5B%5D=%2Fhomes&query=Bali%2C%20Indonesia&place_id=ChIJoQ8Q6NNB0S0RkOYkS7EPkSQ&checkin=2020-12-29&checkout=2021-01-03&source=structured_search_input_header&search_type=autocomplete_click"

Let's begin by making a request to retrieve the HTML code for this website. Since it is an action that we need to perform several times throughout the assignment, let's encapsulate the corresponding code in a function.

<i>get_page</i>. This function takes a url as input and return its underlying HTML code as a <b>BeautifulSoup object</b> as output.

In [2]:
import requests
import bs4

def get_page(url):
  soup=bs4.BeautifulSoup(requests.get(url).text, 'html.parser')
  return soup

The first step in trying to extract information from a webpage is to check how it is constructed. A brief look at the given webpage shows that the information on the different listings is shown underneath each other in a list form. 

<img src="https://www.dropbox.com/s/jl436b6cc1daent/bali_listing.png?raw=1" width="700">

For every listing a preview image is shown together with some standard information, including a title, a subtitle, the number of guests allowed, the number of bedrooms and bathrooms, the number of beds, information about certain ammenities, the price per night, the total price per stay, the average rating and the number of reviews.

<i>get_listings</i>. This function takes a BeautifulSoup object containing the code for a whole webpage as input and return a <b>list</b> of the individual pieces of code for each listing. Selecting the larger pieces of code that contain a single listing.</div>

In [4]:
def get_listings(soup):
  listings=soup.find_all("div", {"class":"_gig1e7"})
  return listings

## Retrieving the data

Let's retrieve separate information for each of them.

<i>get_listing_title</i>. This function takes a soup object containing the code for an individual listing as input and return a <b>string </b>with its title as output. If no title is listed, then the function should return a <i>None</i> in boolean form.</div>

In [5]:
def get_listing_title(listing):
  if listing.find("div", {"class":"_bzh5lkq"}) is None:
    title=None
  else:
    title=listing.find("div", {"class":"_bzh5lkq"}).text
  return title

<i>get_listing_subtitle</i>. This function takes a soup object containing the code for an individual listing as input and return a <b>string</b> with its subtitle as output. If no subtitle is listed, then the function should return a <i>None</i> in boolean form.</div>

In [6]:
def get_listing_subtitle(listing):
  if listing.find("div", {"class":"_167qordg"}) is None:
    subtitle=None
  else:
    subtitle=listing.find("div", {"class":"_167qordg"}).text
  return subtitle

Retrieving the list of attribute that contains information about the number of guests allowed, the number of bedrooms, the number of beds and the number bathrooms. Creation of a new function that retrieves this information for each separate listing. 

<i>get_listing_info</i>. This function should take a soup object containing the code for an individual listing as input and return a <b>string</b> with the general listing information as output. If no information is provided, then the function returns a <i>None</i> in boolean form.</div>

In [7]:
def get_listing_info(listing):
  if listing.find("div", {"class":"_kqh46o"}) is None:
    info=None
  else:
    info=listing.find("div", {"class":"_kqh46o"}).text
  return info

Below that information list, there is yet another list that contains information about different ammenities.

<i>get_listing_ammenities</i>. This function takes a soup object containing the code for an individual listing as input and return a <b>string</b> with the listing ammenities information as output. If no information is provided, then the function returns a <i>None</i> in boolean form.</div>

In [8]:
def get_listing_ammenities(listing):
  if len(listing.find_all("div", {"class":"_kqh46o"}))<2:
    ammenities=None
  else:
    ammenities=listing.find_all("div", {"class":"_kqh46o"})[1].text
  return ammenities

<i>get_listing_rating</i>. This function takes a soup object containing the code for an individual listing as input and return a <b>float</b> with its average rating as output. If no rating is listed, then the function returns a <i>None</i> in boolean form.</div>

In [9]:
def get_listing_rating(listing):
  if "Rating" in listing.find_all("span", {"class":"_krjbj"})[0].text:
    rating=float(listing.find("span", {"class":"_10fy1f8"}).text)
  else:
    rating=None
  return rating

<i>get_listing_reviews</i>. This function takes a soup object containing the code for an individual listing as input and return an <b>int</b> with its number of reviews as output. If no reviews are included, then the function returns a <i>None</i> in boolean form.</div>

In [10]:
def get_listing_reviews(listing):
  if ("reviews" or "review") in listing.find_all("span", {"class":"_krjbj"})[1].text:
    text=listing.find_all("span", {"class":"_krjbj"})[1].text
    int_review=[int(s) for s in text.split() if s.isdigit()]
    rating=int_review[0]  
  else:
    rating=None
  return rating

<i>get_listing_price_per_night</i>. This function takes a soup object containing the code for an individual listing as input and return a <b>str</b> with its corresponding price per night. The string only contains the actual number, nothing else. If no price is listed, then the function returns a <i>None</i> in boolean form.</div>

In [11]:
def get_listing_price_per_night(listing):
    if "Disc" in listing.find("span", {"class":"_1p7iugi"}).text: #"Disc" for "Discounted",I split the string on the symbol $ and on the letter D, and retrieve the right element of it in the list
      price=listing.find("span", {"class":"_1p7iugi"}).text.split("$")[1].split("D")[0]
    else:
      if "rice" in listing.find("span", {"class":"_1p7iugi"}).text:
        price=listing.find("span", {"class":"_1p7iugi"}).text.split("$")[1]
      else:
        price=None
    return price



<i>get_listing_total_price</i>. This function takes a soup object containing the code for an individual listing as input and return a <b>string</b> with its total price. This string only contains the actual number, nothing else. If no total price is listed, then the function returns a <i>None</i> in boolean form.</div>

In [13]:
def get_listing_total_price(listing):
  if "total" in listings[3].find_all("div", {"class":"_17y0hv9"})[0].text:
    total_price=listing.find_all("div", {"class":"_17y0hv9"})[0].text.split("$")[1].split(" ")[0]
  else:
    total_price=None
  return total_price

We need a way to extract the data from all the different pages. At the end of each page, there is a link that allows you to access the next page and we are gonna use it.

<i>find_next_page</i>.This function takes a soup object containing the code for an individual page as input and return the <b> complete url</b> for the next page. If there are no more pages left, it returns a <i>None</i> in boolean form.

In [14]:
base_url = "https://airbnb.com"

def find_next_page(page):
  soup=bs4.BeautifulSoup(requests.get(page).text, 'html.parser')
  if soup.find('a', {'class': '_za9j7e'}) is None:
    return None
  else:
    next_page=base_url+soup.find('a', {'class': '_za9j7e'})['href']
    return next_page

Retrieving the data for the different listings and for all the different pages.

Code to retrieve the data above for all the listings in all the different pages. Storing this information in lists called <i>title</i>, <i>subtitle</i>, <i>info</i>, <i>ammenities</i>, <i>rating</i>, <i>reviews</i>, <i>price_per_night</i> and <i>total_price</i>.

In [15]:
title = []
subtitle = []
info = []
ammenities = []
rating = []
reviews = []
price_per_night = []
total_price = []

webpage=url
n_pages = int(get_page(webpage).find("div",{"class":"_jro6t0"}).text[-2:]) #it works only as long as the number of page is below 100
# Write loop to traverse the pages
for run in range(n_pages):
  soup=get_page(webpage)
  listings=get_listings(soup)
  webpage=find_next_page(webpage)
  for listing in listings:
    title.append(get_listing_title(listing))
    info.append(get_listing_info(listing))
    rating.append(get_listing_rating(listing))
    reviews.append(get_listing_reviews(listing))
    price_per_night.append(get_listing_price_per_night(listing))
    total_price.append(get_listing_total_price(listing))
    subtitle.append(get_listing_subtitle(listing))
    ammenities.append(get_listing_ammenities(listing))

## Saving the data

Now that all the data has been retrieved, storing them in a DataFrame.

In [16]:
import pandas as pd

In [17]:
airbnb = pd.DataFrame(data = {'title': title,
                          'subtitle': subtitle,
                          'info': info,
                          'ammenities': ammenities,
                          'rating': rating,
                          'reviews': reviews,
                          'price_per_night': price_per_night,
                          'total_price':total_price,
                         })

In [18]:
airbnb

Unnamed: 0,title,subtitle,info,ammenities,rating,reviews,price_per_night,total_price
0,LOWER PRICE - Special Offer For Monthly Rental !,Entire villa in Ketewel,8 guests · 4 bedrooms · 5 beds · 4.5 baths,Pool · Wifi · Air conditioning · Kitchen,,,354,2020
1,gmb beachhouse bingin beachfront amazing villa,Island in Kabupaten Badung,4 guests · 3 bedrooms · 4 beds · 1 bath,Wifi · Air conditioning · Kitchen,4.77,82.0,180,1084
2,Beautiful villa on the edge of BLUE LAGOON,Entire villa in Nusa Ceningan,2 guests · 1 bedroom · 1 bed · 1 bath,Pool · Wifi · Air conditioning,4.82,310.0,92,400
3,PROMO -70%- Amazing 4BR Villa With Ricefield View,Entire villa in Kecamatan Ubud,10 guests · 4 bedrooms · 6 beds · 4 baths,Pool · Wifi · Air conditioning · Kitchen,4.94,16.0,459,1969
4,Villa Murai Sumberkima Hill,Entire villa in Pemuteran,2 guests · 1 bedroom · 1 bed · 1 bath,Pool · Wifi · Air conditioning · Kitchen,4.85,20.0,141,803
...,...,...,...,...,...,...,...,...
295,Honeymooners Villa Private Pool w/ beach access,Entire villa in Kecamatan Tabanan,3 guests · 1 bedroom · 1 bed · 1 bath,Pool · Wifi · Air conditioning,5.00,17.0,166,947
296,Exotic Studio Room in Kuta (202),Entire apartment in Kuta,2 guests · 1 bedroom · 1 bed · 1 bath,Wifi · Air conditioning · Kitchen,4.64,14.0,17,61
297,Stunning Boho Designer Villa Canggu Ricefieldview,Entire villa in Kuta Utara,6 guests · 3 bedrooms · 3 beds · 3.5 baths,Pool · Wifi · Air conditioning · Kitchen,4.91,35.0,350,2026
298,Oka's Homestay stndr #2,Private room in Kuta Utara,2 guests · 1 bedroom · 1 bed · 1 private bath,Wifi · Air conditioning,4.00,8.0,14,57


Saving the DataFrame to a csv file by runnign the following cell.

In [19]:
airbnb.to_csv("airbnb.csv", index=False)

## Bonus exercises

Redefine the way in which both the info and the ammenities data are stored. For each of this items, you retrieved the whole string containing information about different elements. Let's extract the separate information for each of these elements.

<div class="alert alert-danger"><b>Bonus 1 </b>Write the code to retrieve the individual data from the <i>info</i> list. This list should contain information about the number of guests, the number of bedrooms, the number of beds and the number of baths. Store the information for the number of guests, bedrooms and bathrooms in separate lists called <i>guests</i>, <i>bedrooms</i> and <i>baths</i>. The number of guests an dbedrooms should be store in <b>int</b> form, while the number of baths should be a <b>float</b></div>

In [20]:
import re

def get_guests(listing):
  info=get_listing_info(listing)
  if "guest" in info:
    number_guests=int(info.split("guest")[0])
  else:
    number_guests=None
  return number_guests

def get_bathrooms(listing):
  info=get_listing_info(listing)
  if "Half-bath" in info:
    number_bathrooms=0.5
  else:
    if " bath" in info:
      temp=info.split("·")[-1]
      number_bathrooms=float(re.findall(r"[-+]?\d*\.\d+|\d+", temp)[0])
    else:
      number_bathrooms=None
  return number_bathrooms

def get_bedrooms(listing):
  info=get_listing_info(listing)
  if ("guest" and "bedroom") in info:
    temp=info.split("·")[1]
    number=[int(s) for s in temp.split() if s.isdigit()][0]
  else:
    if "bedroom" in info:
      temp=info.split("·")[0]
      number=[int(s) for s in temp.split() if s.isdigit()][0]
    else:
      number=None
  return number


#Function creation for the three elements

In [21]:
guests = []
bedrooms = []
baths = []

webpage=url
n_pages = int(get_page(webpage).find("div",{"class":"_jro6t0"}).text[-2:]) #it works only as long as the number of page is below 100
for run in range(n_pages):
  soup=get_page(webpage)
  listings=get_listings(soup)
  webpage=find_next_page(webpage)
  for listing in listings:
    guests.append(get_guests(listing))
    baths.append(get_bathrooms(listing))
    bedrooms.append(get_bedrooms(listing))

Appart from the general information, when searching for the best choice, there might be some specific things we are looking for. I don't know about you, but I would definitely look for a place with a pool, a kitchen, wifi and definitely some air conditioning.

<div class="alert alert-danger"><b>Bonus 2 </b>Write the code to retrieve these data from the <i>ammenities</i> list. Remember that this list contains information about the different services that are offered in str form. For each of the 4 ammenities above (wiki, kitchen, air conditioning and pool) create a new list that stores a 1 in <b>int</b> form if the ammenity is present and a 0 in <b>int</b> form otherwise. Store these values in separate lists called <i>wifi</i>, <i>kitchen</i>, <i>air_conditioning</i> and <i>pool</i>. If the values in your <i>ammenities</i> is missing, use a <i>None</i> in boolean form for every individual ammenity.</div>

In [22]:
def get_wifi(listing):
  info=get_listing_ammenities(listing)
  if len(listing.find_all("div", {"class":"_kqh46o"}))<2:
    presense_wifi=None
  else:
    if "Wifi" in info:
      presense_wifi=int(1)
    else:
      presense_wifi=int(0)
  return presense_wifi

def get_kitchen(listing):
  info=get_listing_ammenities(listing)
  if len(listing.find_all("div", {"class":"_kqh46o"}))<2:
    presense_kitchen=None
  else:
    if "Kitchen" in info:
      presense_kitchen=int(1)
    else:
      presense_kitchen=int(0)
  return presense_kitchen

def get_air_conditioning(listing):
  info=get_listing_ammenities(listing)
  if len(listing.find_all("div", {"class":"_kqh46o"}))<2:
    presense_air_conditioning=None
  else:
    if "conditioning" in info:
      presense_air_conditioning=int(1)
    else:
      presense_air_conditioning=int(0)
  return presense_air_conditioning

def get_pool(listing):
  info=get_listing_ammenities(listing)
  if len(listing.find_all("div", {"class":"_kqh46o"}))<2:
    presense_pool=None
  else:
    if "Pool" in info:
      presense_pool=int(1)
    else:
      presense_pool=int(0)
  return presense_pool

In [23]:
wifi = []
kitchen = []
air_conditioning = []
pool = []

webpage=url
n_pages = int(get_page(webpage).find("div",{"class":"_jro6t0"}).text[-2:]) #it works only as long as the number of page is below 100
for run in range(n_pages):
  soup=get_page(webpage)
  listings=get_listings(soup)
  webpage=find_next_page(webpage)
  for listing in listings:
    wifi.append(get_wifi(listing))
    kitchen.append(get_kitchen(listing))
    air_conditioning.append(get_air_conditioning(listing))
    pool.append(get_pool(listing))

<div class="alert alert-danger"><b>Bonus 3 </b>Write the code to save these data to a DataFrame object called <i>airbnb</i>. This time, instead of including the columns <i>info</i> and <i>ammenities</i>, include the lists you created above. The names of the different columns should be equal to those of the lists you just created: <i>title</i>, <i>subtitle</i>, <i>rating</i>, <i>reviews</i>, <i>price_per_night</i>, <i>total_price</i>, <i>guests</i>, <i>bedrooms</i>, <i>baths</i>, <i>wifi</i>, <i>kitchen</i>, <i>air_conditioning</i> and <i>pool</i>. Don't define any index when defining your DataFrame.</div>

In [25]:
import pandas as pd
airbnb=pd.DataFrame(data = {'title': title,
                          'subtitle': subtitle,
                          'rating': rating,
                          'reviews': reviews,
                          'price_per_night': price_per_night,
                          'total_price': total_price,
                          'guests': guests,
                          'bedrooms':bedrooms,
                          'baths':baths,
                          'wifi':wifi,
                          'kitchen':kitchen,
                          'air_conditioning':air_conditioning,
                          'pool':pool,
                         })

In [26]:
airbnb

Unnamed: 0,title,subtitle,rating,reviews,price_per_night,total_price,guests,bedrooms,baths,wifi,kitchen,air_conditioning,pool
0,LOWER PRICE - Special Offer For Monthly Rental !,Entire villa in Ketewel,,,354,2020,8,4.0,4.5,1.0,0.0,1.0,1.0
1,gmb beachhouse bingin beachfront amazing villa,Island in Kabupaten Badung,4.77,82.0,180,1084,4,2.0,2.0,1.0,1.0,1.0,1.0
2,Beautiful villa on the edge of BLUE LAGOON,Entire villa in Nusa Ceningan,4.82,310.0,92,400,2,1.0,1.0,1.0,0.0,1.0,1.0
3,PROMO -70%- Amazing 4BR Villa With Ricefield View,Entire villa in Kecamatan Ubud,4.94,16.0,459,1969,2,1.0,1.5,1.0,1.0,1.0,1.0
4,Villa Murai Sumberkima Hill,Entire villa in Pemuteran,4.85,20.0,141,803,2,1.0,1.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,Honeymooners Villa Private Pool w/ beach access,Entire villa in Kecamatan Tabanan,5.00,17.0,166,947,2,1.0,1.0,1.0,1.0,1.0,0.0
296,Exotic Studio Room in Kuta (202),Entire apartment in Kuta,4.64,14.0,17,61,2,1.0,1.0,1.0,0.0,1.0,1.0
297,Stunning Boho Designer Villa Canggu Ricefieldview,Entire villa in Kuta Utara,4.91,35.0,350,2026,2,1.0,1.0,1.0,1.0,0.0,0.0
298,Oka's Homestay stndr #2,Private room in Kuta Utara,4.00,8.0,14,57,2,1.0,1.0,1.0,1.0,1.0,0.0


Storing above data with the new information.

In [27]:
airbnb.to_csv("airbnb.csv")