# Business Understanding

Airbnb is a worldwide platform that connects travelers and properties owners. Airbnb is a third party that facilite the communication between the interesented. The property owner accommodate his home/aparment (or even cave) and set a night renting price that the travelers have to pay if they want to stay in. It is important to highlight the fact that the price is not set by the Airbnb platform, the price is determine by the properties owner, Airbnb just charge a commission upon the price.

### Question 1: Looking for a property, ¿which variables should you look for?

### Question 2: Given the characteristics of the property, ¿how high can the price be set?

### Question 3: ¿Which customer reviews seems to affect the price?

In [1]:
# Packages

# Data manipulation

import numpy as np
import pandas as pd
from datetime import datetime
import regex as re

# Visualization

import matplotlib.pyplot as plt
import seaborn as sns
import folium
from folium.plugins import HeatMap
import missingno as msno
from IPython.display import Image


# Other configurations
import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 500)
%matplotlib inline

# Data understanding

In [None]:
# Data

listing = pd.read_csv('listings.csv')
listing.shape

The dataset is not an extensive one, it has 3.585 rows. Although it's abundant in information, it has 95 columns. Next, a brief overview of how the dataset looks like.

In [None]:
listing.head(2)

In [None]:
df = listing[['host_since','host_response_time','host_response_rate','host_is_superhost','host_neighbourhood'
            ,'host_listings_count','host_total_listings_count','neighbourhood','neighbourhood_cleansed','neighbourhood_group_cleansed'
            ,'property_type','room_type','accommodates','bathrooms','bedrooms','beds','bed_type','amenities','square_feet' 
            , 'price','weekly_price','monthly_price','security_deposit','cleaning_fee','guests_included','minimum_nights','availability_30'
            , 'availability_60','availability_90','availability_365','number_of_reviews','first_review','last_review'
            , 'review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin'
            , 'review_scores_communication','review_scores_location','review_scores_value','reviews_per_month'
            , 'latitude','longitude']]

df dataset is a subset of listing. The idea behind creating table df is to partition and select only the variables for the analysis. Out of the 93 columns, only 43 were selected. This selection was based on the appropriate way to answer the following question **¿What derives the price of an Airbnb property?**

Below is a list with the features after the selection process:

- **price**: The target column
- **host_since**: Date when the host started to be part of Airbnb
- **host_is_superhost**: Indicates if the host is superhost
- **neighbourhood_cleansed**: Neighbourhood of the property
- **property_type**: Property type 
- **room_type**: Room type
- **accommodates**: Number of people that coulb stay in the property
- **bathrooms**: Number of bathrooms
- **bedrooms**: Number of bedrooms
- **beds**: Number of beds
- **amenities**: property's features
- **square_feet**: Square feet of the property
- **minimum_nights**: Minimum nights staying
- **number_of_reviews**: Number of reviews
- **review_scores_rating**: Customer scores
- **review_scores_accuracy**: Customer scores
- **review_scores_cleanliness**: Customer scores
- **review_scores_checkin**: Customer scores
- **review_scores_communication**: Customer scores
- **review_scores_location**: Customer scores
- **review_scores_value**: Customer scores
- **reviews_per_month**: Customer scores
- **latitude**: Latitute where the property is located
- **longitude**: Longitude where the property is located

First order of business, null-values

In [None]:
msno.matrix(df);

The previous graph is one way to visualize the missing data in our dataframe. The concept is simple, the rectangle is all the dataframe and the blank spaces are missing values. It's clear that there are some features that will not harmed the analysis if removed.

In [None]:
round(df[['neighbourhood_group_cleansed','square_feet','weekly_price','monthly_price','security_deposit','cleaning_fee']]\
 .isnull().sum(axis = 0)*100 / df.shape[0])

The above query determine the procentaje of missing values in the selected columns. *Neighbourhood_group_cleansed* and *square_feet* will be dropped due to the lack of information. *Weekly_price*, *monthly_price*, *security_deposit*  and *cleaning_fee* will be removed as well as a result of the correlation with the target variable.

In [None]:
df.drop(columns = ['neighbourhood','neighbourhood_group_cleansed','square_feet','weekly_price'
                   ,'monthly_price','security_deposit','cleaning_fee']
       , axis = 1, inplace = True)

The resulting dataframe is the following:

In [None]:
msno.matrix(df);

Worth noting, most of the times if one review is missing, all the reviews are missing. We'll inspect these features later on this project.

Before doing any analysis, we need to fix the format of the target value *price*. It was given to us containing dollar sign and commas in some cases.

In [None]:
df.price

In [None]:
df.loc[:, df.columns.isin(['price'])] = list(df.price.apply(lambda x: int(x[1:-3].replace(',',''))))

## Marginal distributions

With the price variable in the correct format, lets dive-in in the dataset!

### Target variable - Price

In [None]:
fig, ax = plt.subplots(figsize=(14,8));

sns.distplot(df.price)
plt.axvline(x=np.percentile(df.price,50), color = '#820707', label = '50th percentile')
plt.axvline(x=np.percentile(df.price,95), color = 'r', label = '95th percentile')
ax.set_xticks((0, np.percentile(df.price,50),np.percentile(df.price,95),1000,2000,3000,4000 ))

plt.title('Distribution of prices')
plt.xlabel('Prices')
plt.ylabel('Marginal distribution')
plt.legend();

Undoubtedly, the distribution of the target variable is skewed to the right. Additionally, two vertical lines were added to the graph, representing the median and the 95th percentile of the distribution. This last part was intended to emphasize how skewed the distribution is.

In [None]:
fig, ax = plt.subplots(figsize=(14,8));

sns.distplot(df.query('price < 500')['price'])
plt.axvline(x=np.percentile(df.price,50), color = '#820707', label = '50th percentile of all prices')
plt.axvline(x=np.percentile(df.price,95), color = 'r', label = '95th percentile of all prices')
# plt.text(x = np.percentile(df.price,95), y = 0, s = str(round(np.percentile(df.price,95))), ha='center' )
ax.set_xticks((0, 100, np.percentile(df.price,50),200, 300, np.percentile(df.price,95),400, 500 ))

plt.title('Distribution of prices lower than $500')
plt.xlabel('Prices')
plt.ylabel('Marginal distribution')
plt.legend();

print('Prices over $500 are in the top {}% of the distribution'.format(100 - (round(df.query('price < 500')['price'].count() *100 / df.shape[0],1))))

This is a close up of the prices lower than $500$. Similarly to the overall distribution, this subset of the distribution carry the same skewed problem. Although it isn't that marked.
Prices over $500$ are in the top 2.5% of the distribution

In [None]:
fig, ax = plt.subplots(figsize=(14,8));

sns.distplot(np.log(df.price))
plt.axvline(x=np.percentile(np.log(df.price),50), color = '#820707', label = '50th percentile of Log(price)')
plt.axvline(x=np.percentile(np.log(df.price),95), color = 'r', label = '95th percentile of Log(price)')

plt.title('Log-Distribution of prices')
plt.xlabel('Log-Prices')
plt.ylabel('Marginal distribution')
plt.legend();

The righ skewed was solved using a log transformation to the variable. 

In [None]:
df.price = np.log(df.price)

###  Independent variables

In [None]:
## Need to plot the variable
df.property_type = df.property_type.astype(str)

plt.figure(figsize = (16,10))
var_plot = ['accommodates','bedrooms','bathrooms','minimum_nights','room_type', 'property_type']
for i, var in enumerate(var_plot):
    plt.subplot(3,2,i+1)
    sns.ecdfplot(df[var])
    plt.title('Accumulate distribution of ' + var)  
    plt.xlabel('')
    plt.subplots_adjust(hspace = 0.5)
    if var == 'property_type':
        plt.setp(plt.subplot(3,2,i+1).xaxis.get_majorticklabels(), rotation=45)
        

Key annotations

- 80% of the properties in this dataset can accommodate up to 4 people, therefore it can be grouped 
- Less than 20% of the Airbnb have more than 2 bedrooms
- A vast majority of the places have only one bathroom
- Minimum nights variable needs to be checked before adding it into a model. It's not clear why the minimum reservation nights for a Airbnb can be  250 nights
- There are mainly two options of renting a room type, the entire aparment/house or a private room
- Up to 85% of the properties are either a house or an aparment

The amenities column has the following format:

In [None]:
df.amenities

It is necessary to clean this information to get any valuable insight

In [None]:
df.loc[:,df.columns == 'amenities'] = df['amenities'].apply(lambda x: x.replace('{','').replace('}','').replace('"',''))

In [None]:
amenities = pd.Series(np.concatenate(df.amenities.apply(lambda x: x.split(','))))
amenities = amenities.iloc[list(~amenities.isin(['"translation missing: en.hosting_amenity_49"'
                                                 ,'"translation missing: en.hosting_amenity_50"}']) == True)]

In [None]:
fig, ax = plt.subplots(figsize=(14,8))
amenities.value_counts()[:15].plot.barh()
plt.gca().invert_yaxis()
plt.xlabel('Count of records')
plt.title('Amenities');

This variable is complicated to study due to all the combinations of amenities. For example, an aparment may have wireless internet, heating, kitchen, and shampoo, and a house wireless internet, heating, kitchen and carbon monoxide detector. In theory, both properties have a unique set of amenities, and it does, but if we're looking to implement any ML algorithm, it isn't appropiate to concatenate the combination of amenities and add it to the model.
One way of dealing with this variable is to create a dummy variable for each amenity, and then for each property add +50 columns indicating if the property has it or not. 

In [None]:
sort_order = df.neighbourhood_cleansed.value_counts().sort_values(ascending = True).index

fig, ax = plt.subplots(figsize = (16,8))
sns.countplot(df.neighbourhood_cleansed, order = sort_order, palette = 'Blues')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.title('Count of record for neighbourhood')
plt.ylabel('Count of records')
plt.xlabel('');

A question araise, this distribution means that there are non-pleasurable tourists zones? 

## Multivariate distribution and interactions

### Host_since

The *host_since* column is a date, an a date doesn't tell us anything, therefore a transformation is needed. 
A significant variable could be the months since the person became a host, so lets create that!
The initial format for the variable is the following:

In [None]:
df.host_since

In [None]:
meses_host = df.host_since.apply(lambda x: datetime.strptime(x,'%Y-%M-%d'))

In [None]:
def diff_month(d1, d2):
    """"
    This function calculate the difference in months between two dates.
    
    Args: 
        d1: First date
        d2: Second date
    """
    return (d1.year - d2.year) * 12 + d1.month - d2.month

df.host_since = meses_host.apply(lambda x: diff_month(datetime.strptime('2016-01-31','%Y-%M-%d'),x))   

The final format is numeric

In [None]:
df.host_since

It needs to be clarify that the date when the dataset was created is missing, I decided to get the max date of the column an assign that as the date to compare. This is the reason for the 2016-01-31 date. 

## Narrowing-down tails

### Number of reviews

In [None]:
fig, ax = plt.subplots(figsize = (16,10))
sns.boxplot(y = 'price', x = 'number_of_reviews' , data = df)
xticks = []
for i in range(0,190, 10):
     xticks.append(i)
plt.setp(ax, xticks=xticks, xticklabels = xticks)
plt.title('Log-Price vs Number of reviews')
plt.axvline(x=np.percentile(df.number_of_reviews,50), color = '#FFCECE', label = '50th percentile')
plt.axvline(x=np.percentile(df.number_of_reviews,75), color = '#FF9191', label = '75th percentile')
plt.axvline(x=np.percentile(df.number_of_reviews,85), color = '#FF3030', label = '85th percentile')
plt.legend();


The graph illustrates the price distribution across the number of reviews. In addition, it's also shown the 50th, 75th, and 85th percentile of the distribution. The variable doesn't seem to be relevant. The medium price seems to be the same across all number of reviews. After the 75th percentile it becomes highly volatile probably because there are few observation within each number of observation.
I decided that the numbers of reviews greater that 54 will be assign the number 54, this in order to avoid vast tails. 

In [None]:
df.loc[df.number_of_reviews >= 54,['number_of_reviews']] = 54

The new boxplot looks like:

In [None]:
fig, ax = plt.subplots(figsize = (16,10))
sns.boxplot(y = 'price', x = 'number_of_reviews' , data = df)

plt.title('New Log-Price vs Number of reviews')
plt.axvline(x=np.percentile(df.number_of_reviews,50), color = '#FFCECE', label = '50th percentile')
plt.axvline(x=np.percentile(df.number_of_reviews,75), color = '#FF9191', label = '75th percentile')
plt.axvline(x=np.percentile(df.number_of_reviews,85), color = '#FF3030', label = '85th percentile')
plt.legend();

### Minimum nights

In [None]:
fig, ax = plt.subplots(figsize = (16,10))
sns.boxplot(y = 'price', x = 'minimum_nights' , data = df)
plt.title('Log-Price vs minimum_nights')
plt.axvline(x=np.percentile(df.minimum_nights,50), color = '#FFCECE', label = '50th percentile')
plt.axvline(x=np.percentile(df.minimum_nights,75), color = '#FF9191', label = '75th percentile')
plt.axvline(x=np.percentile(df.minimum_nights,85), color = '#FF3030', label = '85th percentile')
plt.legend();


With the same explanation as the previous literal, *minimum_nights* over 6 will be assigned to 6

In [None]:
df.loc[df.minimum_nights >= 6,['minimum_nights']] = 6 

The new boxplot looks like:

In [None]:
fig, ax = plt.subplots(figsize = (16,10))
sns.boxplot(y = 'price', x = 'minimum_nights' , data = df)
plt.title('Log-Price vs minimum_nights')
plt.axvline(x=np.percentile(df.minimum_nights,50), color = '#FFCECE', label = '50th percentile')
plt.axvline(x=np.percentile(df.minimum_nights,75), color = '#FF9191', label = '75th percentile')
plt.axvline(x=np.percentile(df.minimum_nights,85), color = '#FF3030', label = '85th percentile')
plt.legend();

## Property-related variables

In [None]:
plt.figure(figsize = (16,12))
var = ['bedrooms', 'bathrooms', 'bed_type','property_type']
for i, var in enumerate(var):
    plt.subplot(2,2,i+1)
    sns.boxplot(y = 'price', x = var , data = df, palette = "Greens")
    plt.ylabel('Log-Price')
    plt.title('Price vs ' + var)
    if var == 'property_type':
        plt.setp(plt.subplot(2,2,i+1).xaxis.get_majorticklabels(), rotation=45)
    

Keynotes


- Linearity relation between bedrooms and price. Furthermore, it may indicate a case of diminish return law
- Similar to the number of bedrooms, it may present diminish returns law. 
- Looking at the boxplot of the real_bed and comparing to the other bed types is it an straightforward conclusion that people tend to choose real bed over sofas and couches. 
- The graph illustrates that houses tend to be cheaper than apartments. It may be possible that the apartments are located in downtown and the houses in the suburbs. 

### ¿zero bedrooms in an aparment?

Out of curiosity, in the first graph, there are properties without bedrooms, that doesn't make sense. Luckly, looking at the initial data, there are the URL associated to the advertisement. Viewing one of the links, it's clear why an aparment could be zero bedrooms and one bathroom. A LOFT!

In [None]:
Image("loft.png")

## Other variables

In [None]:
plt.figure(figsize = (16,12))
reviews_scores = df[['review_scores_location','review_scores_accuracy','review_scores_cleanliness','review_scores_value']]
for i, var in enumerate(reviews_scores):
    plt.subplot(2,2,i+1)
    sns.boxplot(y = 'price', x = var, data = df, palette = 'Blues')
    plt.ylabel('Log-Price')
    plt.title('Price vs ' + var);
    

Keynotes

- *review_scores_location* and *review_scores_cleanliness* are correlated with the price of renting the property. This result is not surprising, well-located properties and cleanliness take part on the first impression.

### ¿Having a superhost can cause an increased in the price?

In [None]:
df['host_is_superhost'] = np.where(df['host_is_superhost'] == 'f',0,1)

In [None]:
plt.figure(figsize = (14,10))
sns.violinplot(x = 'host_is_superhost', y = 'price', data = df, palette = 'Blues'); 

At first glance, there is no difference in the price of having a superhost or not. But lets partition this analysis within each neighbourhood.

### ¿Is there evidence of Simpson's Paradox in the superhost variable?

In [None]:
neighbourhood = df.neighbourhood_cleansed.unique()
cvec = dict()
for x in neighbourhood:
    temp = df[df['neighbourhood_cleansed'] == x]
    cvec[x] = temp.corr()['price']['host_is_superhost']

corr = list(cvec.values())

plt.figure(figsize = (12,10))
sns.distplot(corr)
plt.title('Superhost grouped by Neighbourhood')
plt.xlabel('Correlation')
plt.ylabel('');

Eventhough the overall distribution of the *host_is_superhost* variable may not show any relation with the price, if the data is partitioned among the neighbourhood and the correlation between *host_is_superhost* and *price* is calculated, it tends to have a positive correlation. 

In [None]:
sort_order = df.query('price <= 500')\
                .groupby('neighbourhood_cleansed')['price']\
                .median()\
                .sort_values(ascending=True)\
                .index

fig, ax = plt.subplots(figsize = (16,8))
sns.boxplot(y='price', x='neighbourhood_cleansed', data=df.query('price <= 500'), 
            order=sort_order, palette = 'Greens')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
sns.despine();

A question arraises, ¿Does the *price*, *neighbourhood* and the *customer scores for the location* have some relation? Lest create a map and investigate!  

## Maps 

#### The following graph shows the relationship between the coordinates and the price

In [None]:
max_value = df.price.max()
folium_hmap = folium.Map(width=800,height=500,
                         location = [42.366516, -71.057424],
                        zoom_start = 13)#,
                        ##tiles = "OpenStreetMap")
hm_wide = HeatMap(list(zip(df['latitude'], df['longitude'], df['price'])),
                 min_opacity = 0.1,
                 radious = 0.2,
                 blur = 6,
                 max_zoom = 15,
                 max_val = 100)
folium.TileLayer('Stamen Terrain').add_to(folium_hmap)
folium_hmap.add_child(hm_wide)

#### The following graph shows the relationship between the coordinates and the customer score of the property location

In [None]:
max_value = df.price.max()
folium_hmap = folium.Map(width=800,height=500,
                         location = [42.366516, -71.057424],
                        zoom_start = 13)#,
                        ##tiles = "OpenStreetMap")
hm_wide = HeatMap(list(zip(df[df['review_scores_location'].notnull()]['latitude']
                           , df[df['review_scores_location'].notnull()]['longitude']
                           , df[df['review_scores_location'].notnull()]['review_scores_location'])),
                 min_opacity = 0.1,
                 radious = 0.2,
                 blur = 6,
                 max_zoom = 15,
                 max_val = 100)
folium.TileLayer('Stamen Terrain').add_to(folium_hmap)
folium_hmap.add_child(hm_wide)

Undoubtedly there is a correlation between the clients perception of the location and the price paid by them. Moving away from dowtown will decrease the price of renting an Airbnb property. Lets quantize this relationship

## Correlations

In [None]:
plt.figure(figsize = (16,12))
var_corr = ['price','bathrooms','bedrooms','beds','review_scores_cleanliness',
            'review_scores_location','longitude','latitude','accommodates']

corr = df[var_corr].corr()

mask = np.triu(np.ones_like(corr, dtype=bool));
f, ax = plt.subplots(figsize=(12, 10));

#cmap = sns.diverging_palette(230, 20, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap='Blues', vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5}, annot = True, fmt = '.2f');

This is a classical correlation matrix, stronger the linear relation, higher the correlation value. Fortunately there are some useful correlation that can be used in a ML algorithm with the price as a target. 

In [None]:
plt.figure(figsize = (16,12))
sns.heatmap(df.groupby(['neighbourhood_cleansed', 'review_scores_location'])\
                .mean()['price']\
                .reset_index()\
                .pivot('neighbourhood_cleansed', 'review_scores_location', 'price')\
                .sort_index(ascending=False),
            cmap="Greens", fmt='.2f', annot=True, linewidths=0.5)
plt.xlabel('Cleanliness calification', fontsize = 15)
plt.ylabel('')
plt.title('Average Log-price', fontsize = 25);

print('Reference table \nLog-price of $4 is ${} \nLog-price of $5 is ${} \nLog-price of $6 is ${}'\
      .format(round(np.exp(4)),round(np.exp(5)),round(np.exp(6))))

First lets explain this table. The neighbourhood are in the y-axis and the customer cleanliness satisfaction in the x_axis. Within each neighbourhood-cleanliness score combination is calculate the average price. Then it's plotted and colored to have some sort of perception of the price. 
Visualy, there is not a clear pattern between those variables, an ML model can specify this statment.

# Conclusions

- Some proper correlations were found between the target variable (price) and the independent variables (In the correlation section are displayed the variables).
- As expected, the properties near downtown have higher prices compared to the suburban. 
- Especial caution is needed due to Sympson's paradox across the variables.
- Customer perception is an influential set of features that derives the price

## Next steps

- Construct and calibrate an ML algorithm with the analysed variables.
- Deeper exploration of the excluding variables. NLP models can be useful