# 1- What are the aspects of a listing that best correlate to price ?

# 2- What are the aspects of a listing that best correlate to availabilty (lack of bookings), and if found (those aspects), do they necessarily correlate  with fully booked listings ?

# 3- What are the aspects of a listing that best correlate to a positive review, or a negative one ?


In [33]:
import pandas as pd
import numpy as np
import seaborn as sns
from bs4 import BeautifulSoup
import urllib


In [34]:
Boston_listings = pd.read_csv('Boston_listings.csv')

In [35]:
Boston_listings

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,3168,https://www.airbnb.com/rooms/3168,20220915162158,2022-09-15,city scrape,TudorStudio,"The ""Studio at 14 Weldon"" is located in Newton...","Newton has 13 unique villages, and gives off a...",https://a0.muscache.com/pictures/ff7952dc-ef0b...,3697,...,,,,,f,1,0,1,0,
1,3781,https://www.airbnb.com/rooms/3781,20220915162158,2022-09-15,city scrape,HARBORSIDE-Walk to subway,Fully separate apartment in a two apartment bu...,"Mostly quiet ( no loud music, no crowed sidewa...",https://a0.muscache.com/pictures/24670/b2de044...,4804,...,4.96,4.87,4.91,,f,1,1,0,0,0.26
2,5506,https://www.airbnb.com/rooms/5506,20220915162158,2022-09-15,city scrape,** Fort Hill Inn Private! Minutes to center!**,"Private guest room with private bath, You do n...","Peaceful, Architecturally interesting, histori...",https://a0.muscache.com/pictures/miso/Hosting-...,8229,...,4.89,4.54,4.73,Approved by the government,f,10,10,0,0,0.69
3,6695,https://www.airbnb.com/rooms/6695,20220915162158,2022-09-15,city scrape,"Fort Hill Inn *Sunny* 1 bedroom, condo duplex","Comfortable, Fully Equipped private apartment...","Peaceful, Architecturally interesting, histori...",https://a0.muscache.com/pictures/38ac4797-e7a4...,8229,...,4.95,4.50,4.71,STR446650,f,10,10,0,0,0.75
4,7903,https://www.airbnb.com/rooms/7903,20220915162158,2022-09-15,city scrape,"Colorful, modern 2 BR apt shared with host",I'm a high school teacher and frequent travele...,"The apartment is in Somerville, located direct...",https://a0.muscache.com/pictures/miso/Hosting-...,14169,...,4.95,4.56,4.80,,f,1,0,1,0,1.84
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5180,716081443145047239,https://www.airbnb.com/rooms/716081443145047239,20220915162158,2022-09-15,city scrape,Private Room with Shared Bath in Quiet Street,*Please Note: You are booking a private room i...,South Boston is a very large neighborhood comp...,https://a0.muscache.com/pictures/prohost-api/H...,2356643,...,,,,STR-460218,f,71,25,46,0,
5181,716081469166085329,https://www.airbnb.com/rooms/716081469166085329,20220915162158,2022-09-15,city scrape,Cozy Bedroom in Convenient Downtown Location,*Please Note: You are booking a private room i...,South Boston is a very large neighborhood comp...,https://a0.muscache.com/pictures/prohost-api/H...,2356643,...,,,,STR-460218,f,71,25,46,0,
5182,716081495310456299,https://www.airbnb.com/rooms/716081495310456299,20220915162158,2022-09-15,city scrape,"Peaceful Bedroom w/ Shared Bath - AC, Wifi inc...",*Please Note: You are booking a private room i...,South Boston is a very large neighborhood comp...,https://a0.muscache.com/pictures/prohost-api/H...,2356643,...,,,,STR-460218,f,71,25,46,0,
5183,716235197792512391,https://www.airbnb.com/rooms/716235197792512391,20220915162158,2022-09-15,city scrape,Sunny Room w/ Shared Bath in Modest Brighton Home,"Perfect for Hospital Stays, Medical Students, ...",The apartment is located in a walkable neighbo...,https://a0.muscache.com/pictures/prohost-api/H...,2356643,...,,,,STR-484106,t,71,25,46,0,


### After a quick data exploration, we need to drop a couple of columns, which are:
> 1- Columns that are mostly null or all null, although nothing of those columns are dropped if it can be scraped as we will see later in the notebook <br>
2- Personal info (Id) <br>
3- Non-relevant info (urls, yet listing_url is needed to validate some of the data as we will see later in the notebook) <br>
4- Meta-data (source) <br>
5- String columns (i.e. those that need TF-IDF) for simplicity

In [11]:
Boston_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5185 entries, 0 to 5184
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            5185 non-null   int64  
 1   listing_url                                   5185 non-null   object 
 2   scrape_id                                     5185 non-null   int64  
 3   last_scraped                                  5185 non-null   object 
 4   source                                        5185 non-null   object 
 5   name                                          5185 non-null   object 
 6   description                                   5140 non-null   object 
 7   neighborhood_overview                         3435 non-null   object 
 8   picture_url                                   5185 non-null   object 
 9   host_id                                       5185 non-null   i

In [12]:
Boston_listings.drop(columns=['id', 'name', 'description', 'last_scraped', 'scrape_id',
                                'host_name', 'host_about', 'host_neighbourhood', 'amenities',
                                'source','picture_url', 'first_review', 'last_review', 'review_scores_rating',
                                'review_scores_accuracy', 'review_scores_cleanliness','review_scores_checkin',
                                'review_scores_communication','review_scores_location', 'review_scores_value',
                                'review_scores_value', 'license', 'host_id', 'host_url','host_thumbnail_url',
                                'host_picture_url','calendar_updated','bathrooms','neighbourhood_group_cleansed',
                             'neighbourhood', 'neighborhood_overview'],
                    inplace=True)


In [13]:
Boston_listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5185 entries, 0 to 5184
Data columns (total 45 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   listing_url                                   5185 non-null   object 
 1   host_since                                    5185 non-null   object 
 2   host_location                                 4485 non-null   object 
 3   host_response_time                            4575 non-null   object 
 4   host_response_rate                            4575 non-null   object 
 5   host_acceptance_rate                          4635 non-null   object 
 6   host_is_superhost                             5182 non-null   object 
 7   host_listings_count                           5185 non-null   int64  
 8   host_total_listings_count                     5185 non-null   int64  
 9   host_verifications                            5185 non-null   o

> The *neighbourhood* column does not have a lot of information, because 2255 instances are **Boston, Massachusetts, United States**, which does not specify the exact region of a listing, unfortunatley *neighborhood_overview* dependes on *neighbourhood*, not to mention the missing values, dropping the two columns seems fine. <br><br> An alternative to the *neighbourhood* column would be *neighbourhood_cleansed*. 

In [7]:

one_hot = pd.get_dummies(Boston_listings['neighbourhood_cleansed'])

Boston_listings = Boston_listings.drop('neighbourhood_cleansed',axis = 1)

Boston_listings = Boston_listings.join(one_hot)

# =================================================================

one_hot = pd.get_dummies(Boston_listings['host_location'], dummy_na=True, prefix='host_location')

Boston_listings = Boston_listings.drop('host_location',axis = 1)

Boston_listings = Boston_listings.join(one_hot)


# =================================================================


one_hot = pd.get_dummies(Boston_listings['bathrooms_text'], dummy_na=True, prefix='bathrooms_text')

Boston_listings = Boston_listings.drop('bathrooms_text',axis = 1)

Boston_listings = Boston_listings.join(one_hot)



# =================================================================


Boston_listings['host_response_rate'] = Boston_listings['host_response_rate'].str.rstrip('%').astype('float')


# =================================================================


Boston_listings['host_acceptance_rate'] = Boston_listings['host_acceptance_rate'].str.rstrip('%').astype('float')


# =================================================================



Boston_listings['price'] = Boston_listings['price'].str.strip('$').str.replace(',', '').astype('float')


> The *listing url* that corresponds to null values of the column `bedrooms` suggest that those nulls mean *zero* or *no actuall bedroom*.

In [8]:
Boston_listings[Boston_listings['bedrooms'].isnull()]['listing_url']

0                     https://www.airbnb.com/rooms/3168
3                     https://www.airbnb.com/rooms/6695
7                    https://www.airbnb.com/rooms/10813
8                    https://www.airbnb.com/rooms/10986
34                  https://www.airbnb.com/rooms/210097
                             ...                       
5137    https://www.airbnb.com/rooms/708066864505175780
5158    https://www.airbnb.com/rooms/711804721312473870
5159    https://www.airbnb.com/rooms/712092718787212242
5174    https://www.airbnb.com/rooms/714906239224334877
5176    https://www.airbnb.com/rooms/715658190467254169
Name: listing_url, Length: 559, dtype: object

In [9]:
Boston_listings['bedrooms'].isnull().mean()

0.10781099324975892

In [15]:
bedrooms_mode = Boston_listings['bedrooms'].mode()

Boston_listings = Boston_listings.fillna({'bedrooms':0})

In [11]:
Boston_listings['bedrooms'].isnull().mean()

0.0

> For host_response_rate we can fill some of the null values <br><br>
1- By web scraping <br>
2- By host_is_superhost feature because you can't be a superhost unless you satisfy those three requirements: <br>
>>1- Completed at least 10 trips or 3 reservations that total at least 100 nights. <br>
2- Maintained a 90% response rate or higher. <br>
3- Maintained a less than 1% cancellation rate, with exceptions made for those that fall under our Extenuating Circumstances policy. <br>
thus, those null values of host_response_rate that corresponds to a host_is_superhost being 't', we can subtitue them with 90% ;

In [12]:
#<span class="ll4r2nl dir dir-ltr">100%</span>
from selenium import webdriver
import time
from bs4 import BeautifulSoup
import re
import requests                                      
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager


options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')  # Last I checked this was necessary.


for iteration, (ind, row) in enumerate(Boston_listings[Boston_listings['host_response_rate'].isnull()].iterrows()):
    
    print('iteration: ', iteration)
    
    url = row['listing_url']
    response = requests.get(url)
    if response.status_code != 200:
        print("URL not valid")
        continue
    
    ser = Service(r"/Users/abdulrahmanalmutlaq/Downloads/chromedriver_mac_arm64/chromedriver")
    driver = webdriver.Chrome(service=ser, options=options)    
    driver.get(url)

    time.sleep(40)
    
    content = driver.page_source.encode('utf-8').strip()
    driver.quit() 
    soup = BeautifulSoup(content,"html.parser")
    
        
        
    officials = soup.findAll("div",{"class":"_1k8vduze"}) 
    res_rate = re.findall(r'(?:\d+%)|(?:\d+\.\d+%)', str(officials))
    if len(res_rate) == 0 :
        print("Fail\t", url)
        continue
        
    res_time =  str(officials).split('ul>')[0].split('li>')[-2].split('span>')[0].split('>')[2].split('<')[0].strip()

    Boston_listings.at[ind, 'host_response_rate'] = float(res_rate[0][:-1])
    Boston_listings.at[ind, 'host_response_time'] = res_time
    print(iteration, "\t", ind, "\t", url)



In [16]:
Boston_listings.loc[(Boston_listings['host_response_rate'].isnull()) & (Boston_listings['host_is_superhost']=='t'), 'host_response_rate'] = 90



In [17]:


one_hot = pd.get_dummies(Boston_listings['host_response_time'], dummy_na=True, prefix='host_response_time')

Boston_listings = Boston_listings.drop('host_response_time',axis = 1)

Boston_listings = Boston_listings.join(one_hot)

In [18]:
Boston_listings[Boston_listings['host_response_rate'].isnull()]

Unnamed: 0,listing_url,host_since,host_location,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,...,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,host_response_time_a few days or more,host_response_time_within a day,host_response_time_within a few hours,host_response_time_within an hour,host_response_time_nan
0,https://www.airbnb.com/rooms/3168,2008-10-17,"Boston, MA",,0%,f,1,1,"['email', 'phone']",t,...,1,0,1,0,,0,0,0,0,1
25,https://www.airbnb.com/rooms/154505,2011-06-25,"Somerville, MA",,,f,7,11,"['email', 'phone']",t,...,7,0,0,7,0.98,0,0,0,0,1
30,https://www.airbnb.com/rooms/190170,2011-07-19,"Boston, MA",,88%,f,1,2,"['email', 'phone']",t,...,1,1,0,0,0.61,0,0,0,0,1
35,https://www.airbnb.com/rooms/219956,2011-06-25,"Somerville, MA",,,f,7,11,"['email', 'phone']",t,...,7,0,0,7,0.71,0,0,0,0,1
37,https://www.airbnb.com/rooms/222081,2011-06-25,"Somerville, MA",,,f,7,11,"['email', 'phone']",t,...,7,0,0,7,0.76,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5037,https://www.airbnb.com/rooms/700676743427427933,2014-01-01,"Boston, MA",,,f,1,1,"['email', 'phone']",t,...,1,1,0,0,,0,0,0,0,1
5063,https://www.airbnb.com/rooms/705105471492230748,2012-07-17,"Boston, MA",,100%,f,2,5,"['email', 'phone']",t,...,2,1,1,0,,0,0,0,0,1
5160,https://www.airbnb.com/rooms/712177073029740980,2015-10-26,"Massachusetts, United States",,,f,1,1,"['email', 'phone']",t,...,1,0,1,0,,0,0,0,0,1
5163,https://www.airbnb.com/rooms/712391877716084035,2016-05-29,"Boston, MA",,,f,1,2,['phone'],t,...,1,1,0,0,,0,0,0,0,1


> Based on `listing_url` column, the `reviews_per_month` null values do not have reviews quite yet (or a handful) , thus subtituting the null values with 0 would the best option.

In [16]:
Boston_listings[Boston_listings['reviews_per_month'].isnull()]

Unnamed: 0,listing_url,host_since,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,...,bathrooms_text_8 baths,bathrooms_text_Half-bath,bathrooms_text_Private half-bath,bathrooms_text_Shared half-bath,bathrooms_text_nan,host_response_time_a few days or more,host_response_time_within a day,host_response_time_within a few hours,host_response_time_within an hour,host_response_time_nan
0,https://www.airbnb.com/rooms/3168,2008-10-17,,0.0,f,1,1,"['email', 'phone']",t,f,...,0,0,0,0,0,0,0,0,0,1
17,https://www.airbnb.com/rooms/55310,2010-07-12,100.0,96.0,t,4,7,"['email', 'phone', 'work_email']",t,t,...,0,0,0,0,0,0,0,1,0,0
29,https://www.airbnb.com/rooms/184893,2011-07-29,90.0,25.0,f,11,22,"['email', 'phone']",t,t,...,0,0,0,0,0,0,0,0,1,0
96,https://www.airbnb.com/rooms/1077105,2013-04-14,,,f,1,1,"['email', 'phone']",t,f,...,0,0,0,0,0,0,0,0,0,1
208,https://www.airbnb.com/rooms/2864688,2011-07-29,90.0,25.0,f,11,22,"['email', 'phone']",t,t,...,0,0,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5180,https://www.airbnb.com/rooms/716081443145047239,2012-05-12,100.0,99.0,f,73,272,"['email', 'phone']",t,t,...,0,0,0,0,0,0,0,0,1,0
5181,https://www.airbnb.com/rooms/716081469166085329,2012-05-12,100.0,99.0,f,73,272,"['email', 'phone']",t,t,...,0,0,0,0,0,0,0,0,1,0
5182,https://www.airbnb.com/rooms/716081495310456299,2012-05-12,100.0,99.0,f,73,272,"['email', 'phone']",t,t,...,0,0,0,0,0,0,0,0,1,0
5183,https://www.airbnb.com/rooms/716235197792512391,2012-05-12,100.0,99.0,f,73,272,"['email', 'phone']",t,t,...,0,0,0,0,0,0,0,0,1,0


In [19]:
Boston_listings = Boston_listings.fillna({'reviews_per_month':0.0})

> For host_acceptance_rate null values, most of the actual listings either <br>
1- have been inactive for a couple of years. <br>
2- have very little reviews (i.e. they just started listing) <br>
a possible solution would to subtitute those null values with zero due to the defintion of host_acceptance_rate form the airbnb website: <br><br>
*Your acceptance rate measures how often you accept or decline reservations. Guest inquiries are not included in the calculation of your acceptance rate. You can see your acceptance rate from the last **365** days by clicking on the Performance tab, then clicking Basic Requirements.*  <br><br>

In [20]:
Boston_listings = Boston_listings.fillna({'host_acceptance_rate':0.0})

> For `host_is_superhost`, we have three null values, all of these are hotels or motels using airbnb, I am not sure of the policy of airbnb, but one can conclude that the this is the reason, also since they are three, a decision to fill those null values with f (they are not superhosts) should not be a problem.

In [21]:
Boston_listings = Boston_listings.fillna({'host_is_superhost':'f'})

> host_since, calendar_last_scraped

In [22]:
Boston_listings['host_since'] = pd.to_datetime(Boston_listings['host_since'], format='%Y.%m.%d')

Boston_listings['host_since_day'] = Boston_listings['host_since'].dt.day
Boston_listings['host_since_month'] = Boston_listings['host_since'].dt.month
Boston_listings['host_since_year'] = Boston_listings['host_since'].dt.year

In [23]:
Boston_listings['calendar_last_scraped'] = pd.to_datetime(Boston_listings['calendar_last_scraped'], format='%Y.%m.%d')

Boston_listings['calendar_last_scraped_day'] = Boston_listings['calendar_last_scraped'].dt.day
Boston_listings['calendar_last_scraped_month'] = Boston_listings['calendar_last_scraped'].dt.month
Boston_listings['calendar_last_scraped_year'] = Boston_listings['calendar_last_scraped'].dt.year

> in the `beds` column, there are 71 null values, those listings don't have a clear pattern some them actually have beds, and some don't, yet all beds are available in a non-traditional bedrom, i.e. it's the bed exists in a room where the kitchen is there or the living room (without doors or walls seperating them, I think a suitable option here is to replace those null values with 1 as there is an actual bed in most of them.

In [12]:
print(Boston_listings[Boston_listings['bedrooms'].isnull()]['beds'].isnull().mean())
len(Boston_listings[Boston_listings['bedrooms'].isnull()]['beds']) * Boston_listings[Boston_listings['bedrooms'].isnull()]['beds'].isnull().mean()



0.023255813953488372


13.0

In [26]:
Boston_listings = Boston_listings.fillna({'beds':1})

> There are two missing values at `minimum_minimum_nights`, and three other columns, since it is only two (all 4 are missing in the same two observation), thus we can drop them.

In [74]:
Boston_listings.dropna(subset=['minimum_minimum_nights'], inplace=True)

In [22]:
Boston_listings['host_verifications'].unique()

array(["['email', 'phone']", "['email', 'phone', 'work_email']",
       "['phone']", "['phone', 'work_email']", "['email']"], dtype=object)

> host_verifications

In [25]:
Boston_listings['email'] = 0
Boston_listings['phone'] = 0
Boston_listings['work_email'] = 0

for ind, row in Boston_listings.iterrows():
    
    stripped = [s.strip(' []\'') for s in row['host_verifications'].strip('[]\'').split(',')]
    
    if 'email' in stripped:
        Boston_listings.at[ind, 'email'] = 1
        
    if 'phone' in stripped:
        Boston_listings.at[ind, 'phone'] = 1
        
    if 'work_email' in stripped:
        Boston_listings.at[ind, 'work_email'] = 1
        
        
        
        
        
        
# stripped = [s.strip(' []\'') for s in Boston_listings['host_verifications'][0].strip('[]\'').split(',')]
# Boston_listings['host_verifications'][0].strip('[]\'').split(',')[0],  Boston_listings['host_verifications'][0]


> dropping columns

In [30]:
Boston_listings.drop(columns=['host_verifications', 'calendar_last_scraped', 'host_since','listing_url'], inplace=True)


# ============
<br><br><br><br><br><br>
# ============


In [None]:
# host_response_rate