# Airbnb Data Mining Notebook

In this notebook we are going to deal with data from a well-known residential rental application, Airbnb. Specifically, based on the data for the Athens area for 3 months of 2019 (February, March and April), we are going to answer the following question: 
- What is the most common type of room_type for our data?
- Plot graphs showing the fluctuation of prices for the 3 month period.
- What are the top 5 neighborhoods with the most reviews?
- What is the neighborhood with most real estate listings?
- How many entries are per neighborhood and per month?
- Plot the histogram of the neighborhood_group variable.
- What is the most common type of room (room_type)?
- What is the most common room type (room_type) in each neighborhood (neighborhood_group)?
- What is the most expensive room type?

## Import Libraries

In [2]:
# Ignoring unnecessory warnings
import warnings
warnings.filterwarnings("ignore")  
# Specialized container datatypes
import collections
# For Map vizualization
import folium
# For data vizualization 
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# For large and multi-dimensional arrays
import numpy as np
# For data manipulation and analysis
import pandas as pd
# Natural language processing library
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.collocations import (
    BigramAssocMeasures,
    BigramCollocationFinder)
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk.util import ngrams
# For random selection 
import random
# For basic cleaning and data preprocessing 
import re
import string 
# Communicating with operating and file system
import os
# Machine learning libary
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors
# For wordcloud generating 
from wordcloud import WordCloud

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/pantelis/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/pantelis/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## Load Dataset

Read the data using pandas' read_csv method and let's look at the dataset info to see if everything is alright.

In [3]:
DATASET = "./data/train.csv"
df = pd.read_csv(DATASET)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28122 entries, 0 to 28121
Data columns (total 31 columns):
id                        28122 non-null int64
month                     28122 non-null object
name                      28092 non-null object
description               27833 non-null object
transit                   19212 non-null object
host_since                28120 non-null object
host_response_rate        23067 non-null object
host_has_profile_pic      28120 non-null object
host_identity_verified    28120 non-null object
neighbourhood             27861 non-null object
city                      28113 non-null object
zipcode                   27204 non-null object
latitude                  28122 non-null float64
longitude                 28122 non-null float64
property_type             28122 non-null object
room_type                 28122 non-null object
accommodates              28122 non-null int64
bathrooms                 28122 non-null float64
bedrooms                  

In [3]:
df.head()

Unnamed: 0,id,month,name,description,transit,host_since,host_response_rate,host_has_profile_pic,host_identity_verified,neighbourhood,...,amenities,price,minimum_nights,availability_365,number_of_reviews,first_review,last_review,review_scores_rating,instant_bookable,cancellation_policy
0,10595,February,"96m2, 3BR, 2BA, Metro, WI-FI etc...",Athens Furnished Apartment No6 is 3-bedroom ap...,Note: 5-day ticket for all the public transpor...,2009-09-08,100%,t,t,Ambelokipi,...,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",$71.00,1,294,17,2011-05-20,2019-01-12,96.0,t,strict_14_with_grace_period
1,10988,February,"75m2, 2-br, metro, wi-fi, cable TV",Athens Furnished Apartment No4 is 2-bedroom ap...,Note: 5-day ticket for all the public transpor...,2009-09-08,100%,t,t,Ambelokipi,...,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",$82.00,1,0,31,2012-10-21,2017-11-23,92.0,t,strict_14_with_grace_period
2,10990,February,"50m2, Metro, WI-FI, cableTV, more",Athens Furnished Apartment No3 is 1-bedroom ap...,Note: 5-day ticket for all the public transpor...,2009-09-08,100%,t,t,Ambelokipi,...,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",$47.00,1,282,27,2012-09-06,2019-02-01,97.0,t,strict_14_with_grace_period
3,10993,February,"Studio, metro, cable tv, wi-fi, etc",The Studio is an -excellent located -close t...,Note: 5-day ticket for all the public transpor...,2009-09-08,100%,t,t,Ambelokipi,...,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",$37.00,1,286,42,2012-09-24,2019-02-02,97.0,t,strict_14_with_grace_period
4,10995,February,"47m2, close to metro,cable TV,wi-fi",AQA No2 is 1-bedroom apartment (47m2) -excell...,Note: 5-day ticket for all the public transpor...,2009-09-08,100%,t,t,Ambelokipi,...,"{TV,""Cable TV"",Internet,Wifi,""Air conditioning...",$47.00,2,308,16,2010-07-08,2019-01-11,95.0,t,strict_14_with_grace_period


For the time being, null values won't be discared. That's why in case we deleted all the null objects, we would exclude a sizeable portion of the data set, and maybe without any meaningful reason. So, we are going to tackle each question step by step and only in case of need we will discard null objects.

## Question 1.1

What is the most common type of room_type for our data?

In [None]:
df['room_type'].value_counts().plot(kind = 'pie', colors=['green', 'gold', 'black'], figsize = (8, 8))
plt.title('Pie Chart for Room Type Distribution', fontsize = 20)
plt.xlabel('Room Type')
plt.ylabel('Number of entries')
plt.show()

It is crystal clear that the most common type of rooms is 'Entire home/apartment'

In [None]:
print('Number of entries for "Entire home/apartment": {}'.format(max(df['room_type'].value_counts())))

## Question 1.2

Plot graphs showing the fluctuation of prices for the 3 month period.

In [None]:
# In order to plot numerical data we have to clean 'price' column by remove '$' symbol in each row
def remove_dollar(row):
    if row[0] == '$':
        return row[1:]
    return row

df['price'] = df['price'].apply(lambda row: float(remove_dollar(row).replace(',','')))

In [None]:
# Calculate mean price for each month
mean_prices = []
months = ['February', 'March', 'April']
for month in months:
    mean_prices.append(np.mean(df.loc[df['month'] == month]['price']))

In [None]:
# Plot price fluctuation over the 3 months
plot = plt.plot(months, mean_prices)
plt.xlabel('Month')
plt.ylabel('Price $')
plt.title('Mean Price Fluctuation over February, March and April')
plt.show()

In [None]:
for i, month in enumerate(months):
    print("Mean price in month {}: ${:.2f}".format(month, mean_prices[i]))

## Question 1.3

 What are the top 5 neighborhoods with the most reviews?

In [None]:
neighs = df.groupby('neighbourhood')
reviews = neighs['number_of_reviews'].sum().sort_values().tail(5)

reviews.plot(kind = 'bar', color=['#e59e6d', '#ba9653', '#963821', 'black', '#007a33'], figsize = (8, 6))
plt.xlabel('Neighbourhood')
plt.ylabel('Reviews')
plt.title('Distribution of reviews in the top neighbourhoods')

In [None]:
# Get the previously found neighbourhoods as a list
list_of_neighs = reviews.keys().tolist()
n_reviews = reviews.tolist()
# And print its members
print("Top 5 neighbourhoods are: \n")
for i,n in enumerate(list_of_neighs):
    print('{} with {} reviews'.format(n, n_reviews[i]))

## Question 1.4

What is the neighborhood with most real estate listings?


In [None]:
res = df['neighbourhood'].value_counts()
# We want the most common neighbourhood, thus the head of the list
neig = res.keys().tolist()[0]
# And also the properties it has
n_props = res.tolist()[0]
print("The neighbourhood with the most listings is {} with {} properties".format(neig, n_props))

## Question 1.5


How many entries are per neighborhood and per month?


### Entries per month

In [None]:
neighbourhoods = df['neighbourhood'].value_counts().keys().tolist()
months = df['month'].value_counts().keys().tolist()

for neighbourhood in neighbourhoods:
    for month in reversed(months):
        print("{} in {}: {}".format(neighbourhood, month, 
            df.loc[(df['neighbourhood'] == neighbourhood) & (df['month'] == month)]['month'].value_counts().tolist()[0]))
    print("----------------------------")

Below we are plotting for three randomly selected neighbourhoods the listings for each month

In [None]:
rand_neighbourhoods = random.choices(neighbourhoods, k=4)

fig = plt.figure()

for idx, rand_neighbourhood in enumerate(rand_neighbourhoods):
    ax = fig.add_subplot(2, 2, idx+1)
    df.loc[df['neighbourhood'] == rand_neighbourhood]['month'].value_counts().sort_values().plot(kind = 'bar', color = ['orange', 'dodgerblue', 'gray'], figsize = (8, 6))
    ax.set_title(rand_neighbourhood)

plt.tight_layout()
plt.show()

## Question 1.6

Plot the histogram of the neighborhood_group variable.


In [None]:
df['neighbourhood'].value_counts().plot(kind = 'bar', color = ['purple','gold'], figsize = (8, 6))
plt.title('Histogram of variable neighbourhood', fontsize = 20)
plt.xlabel('Neighbourhood')
plt.ylabel('Number of entries')
plt.show()

## Question 1.7

What is the most common type of room (room_type) in every neighbourhood?


In [None]:
print("Most common type of room in every neighbourhood: \n")
for neighbourhood in neighbourhoods:
    print("{}: {} - entries: {}".format(neighbourhood, 
                                    np.argmax(df.loc[df['neighbourhood'] == neighbourhood]['room_type'].value_counts()),
                                    np.max(df.loc[df['neighbourhood'] == neighbourhood]['room_type'].value_counts())))


## Question 1.8

What is the most expensive room type?

In [None]:
# Group the data by the room type
room_types = df.groupby('room_type')
# FInd out the mean value of the prices in each room type
prices = room_types['price'].mean().sort_values(ascending = False)
prices.plot(kind = 'bar', color=['#00471b', '#eee1c6', '#0077c0'] ,figsize = (8, 6))
plt.title('Cost per room type', fontsize = 20)
plt.xlabel('Room Type')
plt.ylabel('Cost in $')
plt.show()

In [None]:
types = prices.keys().tolist()
values = prices.tolist()

print('The most expensive room type is "{}" with {:.1f} mean price'.format(types[0], values[0]))

## Question 1.9

Display in a map the listings of month Fenruary along with transit information in popup form.

In [None]:
# store in a new dataframe the info latitude/longitude/transit for month February
data = df[['latitude', 'longitude', 'transit']].loc[df['month'] == 'February']
data.dropna(inplace=True)
tooltip = 'Click me!'

for row in data.itertuples():
    mapit = folium.Map(location=[row.latitude, row.longitude], zoom_start=12)

for row in data[:100].itertuples():
    folium.Marker(location=[row.latitude, row.longitude], popup=row.transit, icon=folium.Icon(icon='info-sign')).add_to(mapit)

In [None]:
# Display map generated from Folium
mapit

## Question 10

Create wordclouds for columns neighbourhood/transit/description/last_review.

### Wordcloud for neighbourhood

In [None]:
df['neighbourhood'].dropna(inplace=True)
wordcloud = WordCloud(max_words=1000,width=840, height=540, background_color="white").generate(' '.join(df['neighbourhood'].tolist()))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

### Wordcloud for transit

In [None]:
df['transit'].dropna(inplace=True)
wordcloud = WordCloud(max_words=1000,width=840, height=540, background_color="black").generate(' '.join(df['transit'].tolist()))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

### Wordcloud for description

In [None]:
df['description'].dropna(inplace=True)
wordcloud = WordCloud(max_words=1000,width=840, height=540, background_color="white").generate(' '.join(df['description'].tolist()))
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

### Wordcloud for last review

lat_review column is an arethmetic one, thus, a wordcloud can not be created

## Extra Questions

### Question 1

Which are the neighbourhoods with the best review score?

In [None]:
# Note: the neighs is the df grouped by the neighbourhood field, as stated above
scores = neighs['review_scores_rating'].mean().sort_values(ascending = False).head(5)
scores.plot(kind = 'barh', color=['powderblue', 'olive', 'indigo', 'magenta', 'gold'] ,figsize = (8, 6))
plt.title('Review mean score per neighbourhood', fontsize = 20)
# We know that the reviews are high, so no need to use all the range (0-100)
plt.xlim((95,100))
plt.ylabel('Neighbourhood')
plt.xlabel('Review score /100')
plt.show()

In [None]:
list_of_neighs = scores.keys().tolist()
n_scores = scores.tolist()
# And print its members
print("Top 5 neighbourhoods are: \n")
for i,n in enumerate(list_of_neighs):
    print('{} with {:.2f} review score'.format(n, n_scores[i]))

### Question 2

How many people in average each room type accommodates?

In [None]:
room_types = df['room_type'].value_counts().keys()
print("Average number of people each room type accommodates: \n")
for room_type in room_types:
    print("{}: {}".format(room_type, round(df.loc[df['room_type'] == room_type]['accommodates'].mean())))

## Recommendation system

At first, create a new dataframe containing only the info id/name/description

In [4]:
df = df[['id', 'name', 'description']]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28122 entries, 0 to 28121
Data columns (total 3 columns):
id             28122 non-null int64
name           28092 non-null object
description    27833 non-null object
dtypes: int64(1), object(2)
memory usage: 659.2+ KB


In [5]:
df.head()

Unnamed: 0,id,name,description
0,10595,"96m2, 3BR, 2BA, Metro, WI-FI etc...",Athens Furnished Apartment No6 is 3-bedroom ap...
1,10988,"75m2, 2-br, metro, wi-fi, cable TV",Athens Furnished Apartment No4 is 2-bedroom ap...
2,10990,"50m2, Metro, WI-FI, cableTV, more",Athens Furnished Apartment No3 is 1-bedroom ap...
3,10993,"Studio, metro, cable tv, wi-fi, etc",The Studio is an -excellent located -close t...
4,10995,"47m2, close to metro,cable TV,wi-fi",AQA No2 is 1-bedroom apartment (47m2) -excell...


In [6]:
# drop any NaN value
df.dropna(inplace=True)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27803 entries, 0 to 28121
Data columns (total 3 columns):
id             27803 non-null int64
name           27803 non-null object
description    27803 non-null object
dtypes: int64(1), object(2)
memory usage: 868.8+ KB


### Preprocessing 

Now define our text precessing function. It will remove any punctuation and stopwords. Also it will convert all letters to lowercase and perform stemming aswell.

In [8]:
def preprocess_text(text):
    # remove all punctuation
    text = re.sub(r'[^\w\d\s]', ' ', text)
    # collapse all white spaces
    text = re.sub(r'\s+', ' ', text)
    # convert to lower case
    text = re.sub(r'^\s+|\s+?$', '', text.lower())
    # remove stop words and perform stemming
    stop_words = nltk.corpus.stopwords.words('english')
    lemmatizer = WordNetLemmatizer() 
    return ' '.join(
        lemmatizer.lemmatize(term) 
        for term in text.split()
        if term not in set(stop_words)
    )

In [9]:
# Concatenate name and description in one column and perform preprocessing
df['info'] = df['name'] + df['description']
df['processed_info'] = df['info'].apply(lambda row : preprocess_text(row))
df.head()

Unnamed: 0,id,name,description,info,processed_info
0,10595,"96m2, 3BR, 2BA, Metro, WI-FI etc...",Athens Furnished Apartment No6 is 3-bedroom ap...,"96m2, 3BR, 2BA, Metro, WI-FI etc...Athens Furn...",96m2 3br 2ba metro wi fi etc athens furnished ...
1,10988,"75m2, 2-br, metro, wi-fi, cable TV",Athens Furnished Apartment No4 is 2-bedroom ap...,"75m2, 2-br, metro, wi-fi, cable TVAthens Furni...",75m2 2 br metro wi fi cable tvathens furnished...
2,10990,"50m2, Metro, WI-FI, cableTV, more",Athens Furnished Apartment No3 is 1-bedroom ap...,"50m2, Metro, WI-FI, cableTV, moreAthens Furnis...",50m2 metro wi fi cabletv moreathens furnished ...
3,10993,"Studio, metro, cable tv, wi-fi, etc",The Studio is an -excellent located -close t...,"Studio, metro, cable tv, wi-fi, etcThe Studio ...",studio metro cable tv wi fi etcthe studio exce...
4,10995,"47m2, close to metro,cable TV,wi-fi",AQA No2 is 1-bedroom apartment (47m2) -excell...,"47m2, close to metro,cable TV,wi-fiAQA No2 is ...",47m2 close metro cable tv wi fiaqa no2 1 bedro...


### Term Frequency - Inverse Document Frequency

In [10]:
tfidf_data = TfidfVectorizer(ngram_range=(1, 2)).fit_transform(df.processed_info)
print(tfidf_data.shape)

(27803, 294419)


As we expected our vectorized data using TF-IDF method contains 27803 rows as the number of listings and 294419 features

### Listings similarity 

In this step, we are going to calculate the similarity of each listing with the remaining ones using tfidf_vec and cosine similarity function

In [49]:
print(type(tfidf_data))

<class 'scipy.sparse.csr.csr_matrix'>


In [39]:
cosine_similarities = cosine_similarity(tfidf_data[0:1], tfidf_data).flatten()
# for index in cosine_similarities.argsort()[:-11:-1]:
    #print(df.iloc[index])
    #print(index)

1.0000000000000002
1.0000000000000002
1.0000000000000002
0.5517143137296137
0.5517143137296137
0.5517143137296137
0.5362982608698765
0.5362982608698765
0.5362982608698765
0.5105698240270342
0.5105698240270342
0.5105698240270342
0.3917294598363376
0.3917294598363376
0.3917294598363376
0.22257208804718234
0.22257208804718234
0.22257208804718234
0.22080567997174255
0.22080567997174255
0.22080567997174255
0.12969615449179875
0.12969615449179875
0.12969615449179875
0.09436161071268466
0.09436161071268466
0.0911014506886918
0.09103793043017969
0.09023771457656543
0.09023771457656543
0.09023771457656543
0.08087769848510365
0.08087769848510365
0.08087769848510365
0.07648973674920546
0.07648973674920546
0.07648973674920546
0.07495211222817418
0.07495211222817418
0.07495211222817418
0.07481695373341753
0.07481695373341753
0.07481695373341753
0.073757109354022
0.073757109354022
0.073757109354022
0.07372495176434979
0.07285183167776291
0.07268362447546148
0.07268362447546148
0.0726597069230974
0.0

0.040582292418195755
0.040580213074905426
0.040580213074905426
0.040580213074905426
0.04057569496536148
0.04057569496536148
0.04057569496536148
0.04057384438874265
0.04057384438874265
0.04057384438874265
0.04055181937725377
0.04055181937725377
0.04054491948064829
0.040543586843344886
0.040543586843344886
0.040543586843344886
0.04054299811798959
0.04054299811798959
0.04053480330068935
0.04053480330068935
0.04053480330068935
0.04053120204418012
0.04053120204418012
0.04053120204418012
0.04051220134946697
0.04051220134946697
0.04051220134946697
0.040509911217430486
0.040509911217430486
0.040509911217430486
0.040501903288538
0.040501903288538
0.040501903288538
0.04048733619980245
0.04048733619980245
0.04048733619980245
0.040484429428731296
0.040484429428731296
0.040484429428731296
0.040460320347313894
0.040460320347313894
0.040457740310648714
0.040457740310648714
0.040457740310648714
0.040452234993967995
0.040452234993967995
0.040452234993967995
0.04044673160235291
0.04044673160235291
0.040

0.03172065827519555
0.03172065827519555
0.03172065827519555
0.03170852037555116
0.03170852037555116
0.03170852037555116
0.031700774957352906
0.031700774957352906
0.031700774957352906
0.03169889246464174
0.03169889246464174
0.03169739125600063
0.03169739125600063
0.03169739125600063
0.03169360885390295
0.03169360885390295
0.03169360885390295
0.031690124676406965
0.031690124676406965
0.031690124676406965
0.03168607824009657
0.03168607824009657
0.03168607824009657
0.03168509748859336
0.031678888829704294
0.031678888829704294
0.031678888829704294
0.03167551010882551
0.03166049026561921
0.03165974419611405
0.03165974419611405
0.03165974419611405
0.03164142720571087
0.03164142720571087
0.03164142720571087
0.03163778802005777
0.03163778802005777
0.031635484460117616
0.031635484460117616
0.031635484460117616
0.03162796295160515
0.03162796295160515
0.03162796295160515
0.031627828636448994
0.03162098928450234
0.03162098928450234
0.03162098928450234
0.031617127714092495
0.031617127714092495
0.031

0.026390755134820972
0.026383751976437365
0.026383751976437365
0.026383152773516393
0.026383152773516393
0.026383152773516393
0.026380424645283992
0.02637553920740117
0.02637553920740117
0.02637553920740117
0.02637536415607997
0.026372870959198033
0.026372870959198033
0.026372870959198033
0.02637243163442103
0.02637243163442103
0.02637243163442103
0.02636356888788173
0.02636356888788173
0.02636356888788173
0.02634599133962852
0.026342293454458172
0.026342293454458172
0.026342293454458172
0.02633573348345345
0.02633573348345345
0.02633573348345345
0.026335375913169548
0.026334604357183142
0.026329592936706413
0.026329592936706413
0.026329592936706413
0.02632573803958966
0.02632573803958966
0.02632573803958966
0.02632242071546506
0.026314221920526584
0.026314221920526584
0.026314221920526584
0.026313099927460372
0.026313099927460372
0.026303934694341547
0.026301068808078327
0.026301068808078327
0.026301068808078327
0.026298625725682392
0.026298625725682392
0.026298264217935977
0.02629826

0.023128477815055417
0.023128477815055417
0.023127197593469186
0.023127197593469186
0.023127197593469186
0.023127038073245332
0.023127038073245332
0.023124854710765502
0.023124854710765502
0.023124854710765502
0.02311846830925111
0.02311846830925111
0.02311846830925111
0.023116269994837797
0.023116269994837797
0.023116269994837797
0.023111080737097972
0.023111080737097972
0.023111080737097972
0.02310092474773245
0.02310092474773245
0.02310092474773245
0.02309958058385743
0.02308840499165977
0.02308840499165977
0.02308840499165977
0.023082052733404877
0.023082052733404877
0.023082052733404877
0.023080259056889613
0.023080259056889613
0.023080259056889613
0.023076006406863102
0.023076006406863102
0.023076006406863102
0.02307194882496924
0.02306779544466693
0.02306779544466693
0.02306779544466693
0.023062167713851788
0.023062167713851788
0.02306183177727032
0.02306183177727032
0.02306183177727032
0.023060751965807372
0.023060751965807372
0.023060751965807372
0.023055129720221525
0.0230545

0.018978633620976574
0.018975247048239
0.018972406308814735
0.018972406308814735
0.018972406308814735
0.018971428823968385
0.01896788802218959
0.01896788802218959
0.01896399947522237
0.01896399947522237
0.01896399947522237
0.01896176579949545
0.01896176579949545
0.01896176579949545
0.01895663582947444
0.01895663582947444
0.01895663582947444
0.018954245306290397
0.018949262173277456
0.018945566414576745
0.018935708771972398
0.018935708771972398
0.018935396404551295
0.018935396404551295
0.018935396404551295
0.01893078814217128
0.01893078814217128
0.01893078814217128
0.018929718191223098
0.018929718191223098
0.018929718191223098
0.018926696595662904
0.018925749350519903
0.018925749350519903
0.018925749350519903
0.018920738604530283
0.018920306540013556
0.01891775397850032
0.01891775397850032
0.01891775397850032
0.018916467288061036
0.018916467288061036
0.018916467288061036
0.018915461028747423
0.018915461028747423
0.018915461028747423
0.01890899178530855
0.01890899178530855
0.018908991785

0.015601170560711065
0.015601170560711065
0.01560057270145565
0.01560057270145565
0.01560057270145565
0.015599047685105111
0.015599047685105111
0.015599047685105111
0.015595859443115107
0.015595859443115107
0.015595859443115107
0.015592243088764024
0.015592243088764024
0.015592243088764024
0.015585933591997132
0.015585933591997132
0.015585933591997132
0.015585886977487508
0.015585886977487508
0.015585886977487508
0.015581507579808534
0.015581507579808534
0.015581507579808534
0.015580789282419866
0.015580789282419866
0.015580789282419866
0.015577522821517666
0.015577522821517666
0.015577264198531685
0.015577264198531685
0.015576238312164791
0.015576238312164791
0.015576238312164791
0.015574772099369522
0.015574772099369522
0.015574772099369522
0.015574715966448735
0.015574715966448735
0.015574715966448735
0.015567278955393298
0.015567278955393298
0.015567278955393298
0.015564461409649934
0.015564461409649934
0.015564461409649934
0.015561061854648468
0.015561061854648468
0.01556106185464

0.012552159561972632
0.012552159561972632
0.012552159561972632
0.012546960386463002
0.012546960386463002
0.012546960386463002
0.01254572295920684
0.01254572295920684
0.01254572295920684
0.012543609323246248
0.012543609323246248
0.012543609323246248
0.01254201057464886
0.01254201057464886
0.01254201057464886
0.012540489365761132
0.012540489365761132
0.012540489365761132
0.012536627019688315
0.01253378372390574
0.01253378372390574
0.01253378372390574
0.012527469971667801
0.012527469971667801
0.012527469971667801
0.012524789141407186
0.012524789141407186
0.012524789141407186
0.01252166812149658
0.01252166812149658
0.01252000268752824
0.01252000268752824
0.01252000268752824
0.012514039428670373
0.01250397524008549
0.01250397524008549
0.012503018056404488
0.012503018056404488
0.012503018056404488
0.012495950715394728
0.012494047417902401
0.012494047417902401
0.012494047417902401
0.012486998534259107
0.012486998534259107
0.012486717620126035
0.012486717620126035
0.012467971006740717
0.012467

0.008677885140302556
0.008677885140302556
0.008677885140302556
0.008672171229625372
0.008672171229625372
0.008672171229625372
0.008668742432637358
0.008668742432637358
0.008667260985664917
0.00866579476388939
0.00866579476388939
0.00866579476388939
0.008662652482171768
0.008662652482171768
0.008662652482171768
0.00866085880285998
0.00866085880285998
0.00866085880285998
0.00866046544712967
0.008653801572932147
0.008653801572932147
0.008653801572932147
0.008652925423984194
0.008652925423984194
0.008652925423984194
0.008639988092419663
0.008639988092419663
0.008639988092419663
0.008636431596328443
0.008636431596328443
0.008636431596328443
0.00863635730995913
0.00863635730995913
0.00863635730995913
0.00862872924396637
0.008624211681622973
0.008615610818783311
0.008612888925614641
0.008612888925614641
0.008612888925614641
0.008612158304567855
0.008612158304567855
0.008612158304567855
0.008608925095898496
0.008608868474460556
0.008607362117368067
0.008607362117368067
0.008607362117368067
0.0

0.003850132802582085
0.003850132802582085
0.003850132802582085
0.003848638035662261
0.003848638035662261
0.003848638035662261
0.003845752704992028
0.0038361535653049774
0.0038328191052378795
0.0038328191052378795
0.0038328191052378795
0.0038298685859781798
0.0038298685859781798
0.0038242160408287745
0.0038242160408287745
0.0038209165416096943
0.0038209165416096943
0.0038209165416096943
0.0038190317087945565
0.0038190317087945565
0.0038190317087945565
0.003807133467620777
0.003807133467620777
0.003799174746559183
0.003799174746559183
0.003797717480112644
0.003797717480112644
0.003797717480112644
0.0037965020822143362
0.0037965020822143362
0.003787179161039343
0.003787179161039343
0.003787179161039343
0.0037866712524573685
0.0037866712524573685
0.0037866712524573685
0.0037809016581852057
0.0037752791148260847
0.0037752791148260847
0.0037752791148260847
0.0037744258738452987
0.0037744258738452987
0.003769764774654541
0.0037597180624602245
0.0037597180624602245
0.0037597180624602245
0.0037

In [10]:
# compute cosine similarity of 2 numpy arrays
def cosine_similarity(X, Y):
    dot = np.dot(X, Y)
    norma = np.linalg.norm(X)
    normb = np.linalg.norm(Y)
    cos = dot / (norma * normb)
    return cos

In [11]:
# listings_similarity = {}
# listings_ids = df['id'].values
# for idx_i, i in enumerate(tfidf_data):
#     # if vectorized listing i has at least one non zero value
#     if i.any(axis=0):
#         for idx_j in range(idx_i+1, len(tfidf_data)):
#             j = tfidf_data[idx_j]
#             # if vectorized listing j has at least one non zero value
#             if j.any(axis=0):
#                 similarity = cosine_similarity(i,j)
#                 # get listings' ids
#                 id1 = listings_ids[idx_i]
#                 id2 = listings_ids[idx_j]
#                 # We want to store in a dictionary the 100 
#                 # most similar listings
#                 if (idx_i < 100):
#                     # store similarity of listings
#                     listings_similarity[(id1, id2)] = similarity
#                 else:
#                     min_id1, min_id2 = min(listings_similarity, key=listings_similarity.get)
#                     if similarity > listings_similarity[(min_id1, min_id2)]:
#                         del listings_similarity[(min_id1, min_id2)]
#                         listings_similarity[(id1, id2)] = similarity

### Recommend most similar listings

In [None]:
model_knn = NearestNeighbors(metric='cosine', algorithm='brute')
model_knn.fit(tfidf_data)

distances, indices = model_knn.kneighbors(tfidf_data[1313].reshape(-1,1), n_neighbors=5)    

In [None]:
for i in range(0, len(distances.flatten())):
    if i == 0:
        print("Titlos")
    else:
        print("{} {}".format(df.index[indices.flatten()[i]], distances.flatten()[i]))

### Top-10 words which commonly co-occur

In [21]:
"""
A utility function which constructs
a list of all words in the column
processed_info.
"""
def get_corpus(data):
    corpus = []
    for row in data.iteritems():
        for sub_item in row[1].split(' '):
            corpus.append(sub_item)
    return corpus

In [22]:
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(get_corpus(df['processed_info']))
top10_collocations = finder.nbest(BigramAssocMeasures.pmi, 10)

In [23]:
print("Top-10 words which commonly co-occur: \n")
for pair_words in top10_collocations:
    print("{} - {}".format(pair_words[0], pair_words[1]))

Top-10 words which commonly co-occur: 

11번도 - 지나는
16pax_wifi_ikea_acropolis_metro - 5min_15athens_1pax_wifi_ikea
227gr - canister
2번 - 4번
3πλή - ηλεκτροβάνα
52th - 66th
5min_15 - athens_1pax_wifi_ikea_acropolis_metro
5min_23athens99_1pax_wifi_ikea_acropolis_metro - 5min_23
5min_50athens99_5pax_wifi_ikea_acropolis_metro - 5min_50
5min_8099_8pax_wifi_ikeabed_acropolis_metro - 5min_80
