# Airbnb Data Mining Notebook

In this notebook we are going to deal with data from a well-known residential rental application, Airbnb. Specifically, based on the data for the Athens area for 3 months of 2019 (February, March and April), we are going to answer the following question: 
* What is the most common type of room_type for our data?
* Plot graphs showing the fluctuation of prices for the 3 month period.
* What are the top 5 neighborhoods with the most reviews?
* What is the neighborhood with most real estate listings?
* How many entries are per neighborhood and per month?
* Plot the histogram of the neighborhood_group variable.
* What is the most common type of room (room_type)?
* What is the most common room type (room_type) in each neighborhood (neighborhood_group)?
* What is the most expensive room type?

## Import Libraries

In [1]:
# Ignoring unnecessory warnings
import warnings
warnings.filterwarnings("ignore")  
# Specialized container datatypes
import collections
# For data vizualization 
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# For large and multi-dimensional arrays
import numpy as np
# For data manipulation and analysis
import pandas as pd
# Natural language processing library
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
# For basic cleaning and data preprocessing 
import re
import string 
# Communicating with operating and file system
import os
# Machine learning libary
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
# For wordcloud generating 
from wordcloud import WordCloud

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/pantelis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [62]:
"""
Check whether or not train.csv has been created. 
If not os walk all over the files in data folder,
and create the dataset based on .csv files.
"""
DATASET = "./data/train.csv"
if os.path.exists(DATASET):
    print("good")
else:
    df_febr = pd.read_csv('./data/febrouary/listings.csv')
    df_march = pd.read_csv('./data/march/listings.csv')
    df_april = pd.read_csv('./data/april/listings.csv')
    """
    for root, _, files in os.walk("./data", topdown=False):
        for file in files:
            if (file.endswith(".csv")):
                print(os.path.join(root, file))
    """
    #df = df_febr.append(df_march)
    #df = df.append(df_april)
    df = pd.concat([df_febr, df_march, df_april], ignore_index=True)

In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28122 entries, 0 to 28121
Columns: 106 entries, id to reviews_per_month
dtypes: float64(24), int64(21), object(61)
memory usage: 22.7+ MB


In [64]:
df_febr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9100 entries, 0 to 9099
Columns: 106 entries, id to reviews_per_month
dtypes: float64(24), int64(21), object(61)
memory usage: 7.4+ MB


In [65]:
df_march.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9361 entries, 0 to 9360
Columns: 106 entries, id to reviews_per_month
dtypes: float64(22), int64(23), object(61)
memory usage: 7.6+ MB


In [90]:
df_april.info()
for col in df_april.columns:
    print(col)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9661 entries, 0 to 9660
Columns: 106 entries, id to reviews_per_month
dtypes: float64(22), int64(23), object(61)
memory usage: 7.8+ MB
id
listing_url
scrape_id
last_scraped
name
summary
space
description
experiences_offered
neighborhood_overview
notes
transit
access
interaction
house_rules
thumbnail_url
medium_url
picture_url
xl_picture_url
host_id
host_url
host_name
host_since
host_location
host_about
host_response_time
host_response_rate
host_acceptance_rate
host_is_superhost
host_thumbnail_url
host_picture_url
host_neighbourhood
host_listings_count
host_total_listings_count
host_verifications
host_has_profile_pic
host_identity_verified
street
neighbourhood
neighbourhood_cleansed
neighbourhood_group_cleansed
city
state
zipcode
market
smart_location
country_code
country
latitude
longitude
is_location_exact
property_type
room_type
accommodates
bathrooms
bedrooms
beds
bed_type
amenities
square_feet
price
weekly_price
monthly_price
securi

In [67]:
df['month'] = 0

In [68]:
df.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,month
0,10595,https://www.airbnb.com/rooms/10595,20190208211339,2019-02-08,"96m2, 3BR, 2BA, Metro, WI-FI etc...",Athens Furnished Apartment No6 is 3-bedroom ap...,Athens Furnished Apartment No6 is an excellent...,Athens Furnished Apartment No6 is 3-bedroom ap...,none,Ampelokipi district is nice multinational and ...,...,f,strict_14_with_grace_period,f,f,8,8,0,0,0.18,0
1,10988,https://www.airbnb.com/rooms/10988,20190208211339,2019-02-08,"75m2, 2-br, metro, wi-fi, cable TV",Athens Furnished Apartment No4 is 2-bedroom ap...,Athens Furnished Apartment No4 is an excellent...,Athens Furnished Apartment No4 is 2-bedroom ap...,none,Ampelokipi district is nice multinational and ...,...,f,strict_14_with_grace_period,f,f,8,8,0,0,0.4,0
2,10990,https://www.airbnb.com/rooms/10990,20190208211339,2019-02-08,"50m2, Metro, WI-FI, cableTV, more",Athens Furnished Apartment No3 is 1-bedroom ap...,Athens Furnished Apartment No3 is an excellent...,Athens Furnished Apartment No3 is 1-bedroom ap...,none,Ampelokipi district is nice multinational and ...,...,f,strict_14_with_grace_period,f,f,8,8,0,0,0.35,0
3,10993,https://www.airbnb.com/rooms/10993,20190208211339,2019-02-08,"Studio, metro, cable tv, wi-fi, etc",The Studio is an -excellent located -close t...,"AQA No1 is an excellent located, close to metr...",The Studio is an -excellent located -close t...,none,Ampelokipi district is nice multinational and ...,...,f,strict_14_with_grace_period,f,f,8,8,0,0,0.54,0
4,10995,https://www.airbnb.com/rooms/10995,20190208211339,2019-02-08,"47m2, close to metro,cable TV,wi-fi",AQA No2 is 1-bedroom apartment (47m2) -excell...,"AQA No2 is an excellent located, close to metr...",AQA No2 is 1-bedroom apartment (47m2) -excell...,none,Ampelokipi district is nice multinational and ...,...,f,strict_14_with_grace_period,f,f,8,8,0,0,0.15,0


In [69]:
colt = list(df_febr.columns)
colt.pop(0)
col = ['id', 'month'] + colt 
df = df.reindex(columns=col)

In [70]:
df.head()

Unnamed: 0,id,month,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,10595,0,https://www.airbnb.com/rooms/10595,20190208211339,2019-02-08,"96m2, 3BR, 2BA, Metro, WI-FI etc...",Athens Furnished Apartment No6 is 3-bedroom ap...,Athens Furnished Apartment No6 is an excellent...,Athens Furnished Apartment No6 is 3-bedroom ap...,none,...,t,f,strict_14_with_grace_period,f,f,8,8,0,0,0.18
1,10988,0,https://www.airbnb.com/rooms/10988,20190208211339,2019-02-08,"75m2, 2-br, metro, wi-fi, cable TV",Athens Furnished Apartment No4 is 2-bedroom ap...,Athens Furnished Apartment No4 is an excellent...,Athens Furnished Apartment No4 is 2-bedroom ap...,none,...,t,f,strict_14_with_grace_period,f,f,8,8,0,0,0.4
2,10990,0,https://www.airbnb.com/rooms/10990,20190208211339,2019-02-08,"50m2, Metro, WI-FI, cableTV, more",Athens Furnished Apartment No3 is 1-bedroom ap...,Athens Furnished Apartment No3 is an excellent...,Athens Furnished Apartment No3 is 1-bedroom ap...,none,...,t,f,strict_14_with_grace_period,f,f,8,8,0,0,0.35
3,10993,0,https://www.airbnb.com/rooms/10993,20190208211339,2019-02-08,"Studio, metro, cable tv, wi-fi, etc",The Studio is an -excellent located -close t...,"AQA No1 is an excellent located, close to metr...",The Studio is an -excellent located -close t...,none,...,t,f,strict_14_with_grace_period,f,f,8,8,0,0,0.54
4,10995,0,https://www.airbnb.com/rooms/10995,20190208211339,2019-02-08,"47m2, close to metro,cable TV,wi-fi",AQA No2 is 1-bedroom apartment (47m2) -excell...,"AQA No2 is an excellent located, close to metr...",AQA No2 is 1-bedroom apartment (47m2) -excell...,none,...,t,f,strict_14_with_grace_period,f,f,8,8,0,0,0.15


In [82]:
for idx, row in df.iterrows():
    if idx < 9100:
        df.loc[idx, "month"] = 'February'
    elif idx < 18461:
        df.loc[idx, "month"] = 'March'
    else:
        df.loc[idx, "month"] = 'April'

In [86]:
good_cols = ['id', 'month', 'zipcode', 'transit', 'bedrooms', 'beds', 'review_scores_rating', 'number_of_reviews', 'neighbourhood', 'neighbourhood_group', 'name', 'latitude', 'longitude', 'last_review', 'instant_bookable', 'host_since', 'host_response_rate', 'host_identity_verified', 'host_has_profile_pic', 'first_review', 'description', 'city', 'cancellation_policy', 'bed_type', 'bathrooms', 'accommodates', 'amenities', 'room_type', 'property_type', 'log_price', 'availability_365', 'minimum_nights']
for col in list(df.columns):
    if col not in good_cols:
        df.drop(columns=col, axis=1, inplace=True)

In [87]:
df.head()

Unnamed: 0,id,month,name,description,transit,host_since,host_response_rate,host_has_profile_pic,neighbourhood,city,...,bed_type,amenities,minimum_nights,availability_365,number_of_reviews,first_review,last_review,review_scores_rating,instant_bookable,cancellation_policy
0,10595,February,"96m2, 3BR, 2BA, Metro, WI-FI etc...",Athens Furnished Apartment No6 is 3-bedroom ap...,Note: 5-day ticket for all the public transpor...,2009-09-08,100%,t,Ambelokipi,Athens,...,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1,294,17,2011-05-20,2019-01-12,96.0,t,strict_14_with_grace_period
1,10988,February,"75m2, 2-br, metro, wi-fi, cable TV",Athens Furnished Apartment No4 is 2-bedroom ap...,Note: 5-day ticket for all the public transpor...,2009-09-08,100%,t,Ambelokipi,Athens,...,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1,0,31,2012-10-21,2017-11-23,92.0,t,strict_14_with_grace_period
2,10990,February,"50m2, Metro, WI-FI, cableTV, more",Athens Furnished Apartment No3 is 1-bedroom ap...,Note: 5-day ticket for all the public transpor...,2009-09-08,100%,t,Ambelokipi,Athens,...,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1,282,27,2012-09-06,2019-02-01,97.0,t,strict_14_with_grace_period
3,10993,February,"Studio, metro, cable tv, wi-fi, etc",The Studio is an -excellent located -close t...,Note: 5-day ticket for all the public transpor...,2009-09-08,100%,t,Ambelokipi,Athens,...,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",1,286,42,2012-09-24,2019-02-02,97.0,t,strict_14_with_grace_period
4,10995,February,"47m2, close to metro,cable TV,wi-fi",AQA No2 is 1-bedroom apartment (47m2) -excell...,Note: 5-day ticket for all the public transpor...,2009-09-08,100%,t,Ambelokipi,Athens,...,Real Bed,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",2,308,16,2010-07-08,2019-01-11,95.0,t,strict_14_with_grace_period


In [88]:
df.columns

Index(['id', 'month', 'name', 'description', 'transit', 'host_since',
       'host_response_rate', 'host_has_profile_pic', 'neighbourhood', 'city',
       'zipcode', 'latitude', 'longitude', 'property_type', 'room_type',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type',
       'amenities', 'minimum_nights', 'availability_365', 'number_of_reviews',
       'first_review', 'last_review', 'review_scores_rating',
       'instant_bookable', 'cancellation_policy'],
      dtype='object')