# Project 4: Exploring the Seattle Airbnb Dataset (Part 1 - Reviews)

As part of the submission for the 4th Project in the Udacity Data Science Nanodegree, this notebook code from the analysis of the Seattle Airbnb dataset available on Kaggle: https://www.kaggle.com/airbnb/seattle/data.
Part 1 covers:

- Identifying positive and negative customer reviews/sentiment in the dataset 
- understanding customer sentiment and how it changed over time in different Seattle Neighbourhoods.


In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
!python -m pip install langdetect
from langdetect import detect
import time
from translate import Translator as tran
import time
from googletrans import Translator
! pip install afinn
from afinn import Afinn
! pip install gensim
import gensim
import nltk




In [297]:
#read in review data
reviews = pd.read_csv(r"C:\Users\Adetomiwa\Desktop\DATA SCIENCE DEGREE\Python\Project 4\seattle\reviews.csv")

In [405]:
reviews.shape

(84821, 8)

In [299]:
reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...
1,7202016,39087409,2015-07-20,32440555,Frank,Kelly has a great room in a very central locat...
2,7202016,39820030,2015-07-26,37722850,Ian,"Very spacious apartment, and in a great neighb..."
3,7202016,40813543,2015-08-02,33671805,George,Close to Seattle Center and all it has to offe...
4,7202016,41986501,2015-08-10,34959538,Ming,Kelly was a great host and very accommodating ...


In [300]:
reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84849 entries, 0 to 84848
Data columns (total 6 columns):
listing_id       84849 non-null int64
id               84849 non-null int64
date             84849 non-null object
reviewer_id      84849 non-null int64
reviewer_name    84849 non-null object
comments         84831 non-null object
dtypes: int64(3), object(3)
memory usage: 3.9+ MB


In [301]:
# We'll drop the blank comment columns
reviews = reviews.dropna()

In [302]:
# Detect the languages of each comment for anomalies
comments = pd.DataFrame(reviews['comments'])
language = []
lang_index = []
errors = []
for index, row in comments.iterrows():
    try:
        language += [detect(str(comments.comments[index]))]
        lang_index += [index]
    except: 
        language += ['investigate']
        lang_index += [index]
        errors.append(index)
        print('this index gives an error:',index)

this index gives an error: 4300
this index gives an error: 8863
this index gives an error: 9112
this index gives an error: 10303
this index gives an error: 13890
this index gives an error: 28801
this index gives an error: 35051
this index gives an error: 51426
this index gives an error: 66132
this index gives an error: 67811
this index gives an error: 71206
this index gives an error: 77823


In [303]:
# Create a dataframe of the language values maintaining the original index of where they came from
lang_df = pd.DataFrame({'language':language}, index = lang_index)

In [406]:
# add, language to the dataframe, check how many languges to translate/investigate
reviews['language'] = lang_df
reviews['language'].value_counts()

en             83779
fr               239
de               211
zh-cn            160
es                79
ko                51
ro                42
nl                40
ja                29
pt                26
it                25
zh-tw             23
af                22
so                15
da                13
sv                10
ru                 8
no                 7
pl                 5
ca                 5
fi                 5
tl                 4
cy                 3
tr                 3
investigate        2
sl                 2
hr                 2
cs                 2
vi                 1
sw                 1
et                 1
el                 1
sk                 1
sq                 1
th                 1
fa                 1
hu                 1
Name: language, dtype: int64

In [484]:
# Let's grab the reviews we want to investigate

reviews[reviews['language'] == 'investigate']

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,language,en_comments


- We see 2 comments with happy reviews and will keep these comments :) 
- Other comments we expect to drop off with further preprocessing and cleansing

In [306]:
# replace happy smileys with 'I am happy' and update their language to english
reviews.loc[reviews.comments == ":)", 'comments'] = '"I am happy"'
reviews.loc[(reviews.comments == '"I am happy"') & (reviews.language == 'investigate'), 'language'] = 'en'


In [307]:
# Drop other 'investigate' languages.
reviews = reviews[reviews.language != 'investigate']
reviews

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,language
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...,en
1,7202016,39087409,2015-07-20,32440555,Frank,Kelly has a great room in a very central locat...,en
2,7202016,39820030,2015-07-26,37722850,Ian,"Very spacious apartment, and in a great neighb...",en
3,7202016,40813543,2015-08-02,33671805,George,Close to Seattle Center and all it has to offe...,en
4,7202016,41986501,2015-08-10,34959538,Ming,Kelly was a great host and very accommodating ...,en
...,...,...,...,...,...,...,...
84844,3624990,50436321,2015-10-12,37419458,Ryan,The description and pictures of the apartment ...,en
84845,3624990,51024875,2015-10-17,6933252,Linda,We had an excellent stay. It was clean and com...,en
84846,3624990,51511988,2015-10-20,19543701,Jaime,"Gran ubicación, cerca de todo lo atractivo del...",es
84847,3624990,52814482,2015-11-02,24445024,Jørgen,"Very good apartement, clean and well sized. Si...",en


In [438]:
# Tranlate function using Translate 

def translate_comments(df): 
    '''
    Translate the comments column of a dataframe
    df - dataframe containing comments
    result will append a column 'en_comments' which stores the english translation in the same dataframe
    '''
    translator = tran(to_lang = 'en')
    for idx, row in df.iterrows():
        if df.loc[idx, 'language'] != 'en':
            df.loc[idx, 'en_comments'] = translator.translate(row['comments'])
    return df

In [472]:
# Tranlate function using GoogleTrans 

def translate_comments_goog(df): 
    '''
    Translate the comments column of a dataframe
    df - dataframe containing comments
    result will append a column 'en_comments' which stores the english translation in the same dataframe
    '''
    
    for idx, row in df.iterrows():
        if df.loc[idx, 'language'] != 'en':
            source = df.loc[idx, 'language']
            translator = Translator()
            df.loc[idx, 'en_comments'] = translator.translate(row['comments'], src = source, dest = 'en').text
            time.sleep(1)
    return df

***Why are there 2 Translation functions?!***

I tested 2 different translation functions originally, as I was having trouble getting them to work. Ultimately got the google translate api working with a couple of tweaks: 

- Only passed a subset of non-english comments to be translated
- Added a sleep time of 1 second in between each iteration, given the google translate API limits within a certain time window. This added some processing time.


In [427]:
# define function to detect the language of the comments

def detect_lang(comments, df):
    
    '''
    pass the comments column to be evaluated to detect the language of the comments and append the result
    into the reviews dataframe.
    
    comments - the name of the column that contains customer comments
    df - The dataframe containing the comments to be translated
    The function will append a column 'language' to the original dataframe that was passed
    
    '''
    text = pd.DataFrame(df['comments'])
    language = []
    lang_index = []
    errors = []
    for index, row in text.iterrows():
        try:
            language += [detect(str(text.comments[index]))]
            lang_index += [index]
        except: 
            language += ['investigate']
            lang_index += [index]
            errors.append(index)
            print('this index gives an error:',index)
        df['language'] = pd.DataFrame({'language':language}, index = lang_index)
        
    return df['language'].value_counts()

In [470]:
# Identify the reviews to translate to english

translate_me = reviews[reviews.language != 'en']
translate_me

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,language,en_comments
9,7202016,48388999,2015-09-26,38110731,Tatiana,"The place was really nice, clean, and the most...",es,"The place was really nice, clean, and the most..."
80,7735100,43837578,2015-08-22,40603512,Puhong,我们是一家三口，可爱的女儿，夫妻二人都是中国来的访问学者，来到美丽的西雅图，住在了Roger...,zh-cn,我们是一家三口，可爱的女儿，夫妻二人都是中国来的访问学者，来到美丽的西雅图，住在了Roger...
288,7550234,47535547,2015-09-20,40619236,Theo,Sehr gut,de,Sehr gut
301,7550234,56949077,2015-12-20,47521443,Ying,房间的描述与实际相符，离华盛顿西雅图分校很近，非常完美，干净且安静，wifi很好，早餐也比较...,zh-cn,房间的描述与实际相符，离华盛顿西雅图分校很近，非常完美，干净且安静，wifi很好，早餐也比较...
386,1205666,16974756,2014-08-05,14362449,Uwe,Sean und seine Frau Clara sind sehr freundlich...,de,Sean und seine Frau Clara sind sehr freundlich...
...,...,...,...,...,...,...,...,...
84121,6079131,46598039,2015-09-12,28649034,Martha Eugenia,"Fue agradable, los anfitriones preparan tu est...",es,"Fue agradable, los anfitriones preparan tu est..."
84217,264829,28179087,2015-03-19,28485297,Gele,nice place it stadies like home,af,nice place it stadies like home
84338,7619060,52305251,2015-10-27,20806545,Kevin,非常舒适安静的房子，床边有窗户，晚上睡觉可以看见星星月亮，房东非常热情，退房的时候我去机场，...,zh-cn,非常舒适安静的房子，床边有窗户，晚上睡觉可以看见星星月亮，房东非常热情，退房的时候我去机场，...
84378,4577542,44747593,2015-08-29,25805389,Sebastian,"Die Unterkunft war sehr schön und genau so, wi...",de,"Die Unterkunft war sehr schön und genau so, wi..."


In [474]:
# Translate non-english reviews

translate_comments_goog(translate_me)

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,language,en_comments
9,7202016,48388999,2015-09-26,38110731,Tatiana,"The place was really nice, clean, and the most...",es,"The place was really nice, clean, and the Most..."
80,7735100,43837578,2015-08-22,40603512,Puhong,我们是一家三口，可爱的女儿，夫妻二人都是中国来的访问学者，来到美丽的西雅图，住在了Roger...,zh-cn,"We are a family of three, lovely daughter, hus..."
288,7550234,47535547,2015-09-20,40619236,Theo,Sehr gut,de,Very good
301,7550234,56949077,2015-12-20,47521443,Ying,房间的描述与实际相符，离华盛顿西雅图分校很近，非常完美，干净且安静，wifi很好，早餐也比较...,zh-cn,Description consistent with the reality of the...
386,1205666,16974756,2014-08-05,14362449,Uwe,Sean und seine Frau Clara sind sehr freundlich...,de,"Sean and his wife Clara are very friendly, att..."
...,...,...,...,...,...,...,...,...
84121,6079131,46598039,2015-09-12,28649034,Martha Eugenia,"Fue agradable, los anfitriones preparan tu est...",es,"It was nice, the hosts prepare your stay with ..."
84217,264829,28179087,2015-03-19,28485297,Gele,nice place it stadies like home,af,nice place it stadies like home
84338,7619060,52305251,2015-10-27,20806545,Kevin,非常舒适安静的房子，床边有窗户，晚上睡觉可以看见星星月亮，房东非常热情，退房的时候我去机场，...,zh-cn,"Very comfortable and quiet house, bed windows,..."
84378,4577542,44747593,2015-08-29,25805389,Sebastian,"Die Unterkunft war sehr schön und genau so, wi...",de,The accommodation was very nice and exactly as...


In [480]:
# Append newly translated reviews to the other reviews that were already english

english_comments = reviews[reviews.language == 'en']
reviews_en = english_comments.append(translate_me)

In [481]:
reviews_en

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,language,en_comments
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...,en,Cute and cozy place. Perfect location to every...
1,7202016,39087409,2015-07-20,32440555,Frank,Kelly has a great room in a very central locat...,en,Kelly has a great room in a very central locat...
2,7202016,39820030,2015-07-26,37722850,Ian,"Very spacious apartment, and in a great neighb...",en,"Very spacious apartment, and in a great neighb..."
3,7202016,40813543,2015-08-02,33671805,George,Close to Seattle Center and all it has to offe...,en,Close to Seattle Center and all it has to offe...
4,7202016,41986501,2015-08-10,34959538,Ming,Kelly was a great host and very accommodating ...,en,Kelly was a great host and very accommodating ...
...,...,...,...,...,...,...,...,...
84121,6079131,46598039,2015-09-12,28649034,Martha Eugenia,"Fue agradable, los anfitriones preparan tu est...",es,"It was nice, the hosts prepare your stay with ..."
84217,264829,28179087,2015-03-19,28485297,Gele,nice place it stadies like home,af,nice place it stadies like home
84338,7619060,52305251,2015-10-27,20806545,Kevin,非常舒适安静的房子，床边有窗户，晚上睡觉可以看见星星月亮，房东非常热情，退房的时候我去机场，...,zh-cn,"Very comfortable and quiet house, bed windows,..."
84378,4577542,44747593,2015-08-29,25805389,Sebastian,"Die Unterkunft war sehr schön und genau so, wi...",de,The accommodation was very nice and exactly as...


In [None]:
# Preprocessing steps
# Removing accented characters
# Removing Special Characters
# Stemming
# Removing Stopwords

In [546]:
# Score Sentiment with Afinn
af = Afinn()

# compute sentiment scores (polarity) and labels
reviews_en['sentiment'] = [af.score(comment) for comment in reviews_en.en_comments]




In [547]:
reviews_en[reviews_en['sentiment']<0].sort_values('sentiment')

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,language,en_comments,sentiment
67499,3528627,35373295,2015-06-18,29877821,Jun-Yan,"If you want to get disturbed everyday, have no...",en,"If you want to get disturbed everyday, have no...",-32.0
32404,3291777,14760768,2014-06-25,16881604,Annar,Melissa replied to our request and approved bu...,en,Melissa replied to our request and approved bu...,-25.0
71158,3449059,32507141,2015-05-19,29317777,Jane,"While the location of this place is great, we ...",en,"While the location of this place is great, we ...",-23.0
75564,1775016,21243349,2014-10-13,20072109,Anna,Staying at Robert’s place was a nightmare. At ...,en,Staying at Robert’s place was a nightmare. At ...,-22.0
78759,637710,17123011,2014-08-07,13045575,Kenneth,Clean only by the standards of a party or frat...,en,Clean only by the standards of a party or frat...,-20.0
...,...,...,...,...,...,...,...,...,...
45341,3890990,24790268,2015-01-02,21711664,Simon,This was a nice little house in Greenwood. The...,en,This was a nice little house in Greenwood. The...,-1.0
46236,557126,35065503,2015-06-15,27976281,Lauren,We met Chad who was very friendly and welcomin...,en,We met Chad who was very friendly and welcomin...,-1.0
46554,241032,35307619,2015-06-17,32540568,Leonardo,Maija - was very attentive and helped be flex...,en,Maija - was very attentive and helped be flex...,-1.0
40082,1163345,20100461,2014-09-23,17256405,Mia,First thing that needs to be noted is that the...,en,First thing that needs to be noted is that the...,-1.0


In [554]:
#Import listings dataset to merge descriptive columns for the listings (location, neighbourhood.. etc)
listings = pd.read_csv(r"C:\Users\Adetomiwa\Desktop\DATA SCIENCE DEGREE\Python\Project 4\seattle\listings.csv")

In [555]:
listings.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,241032,https://www.airbnb.com/rooms/241032,20160104002432,2016-01-04,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,none,,...,10.0,f,,WASHINGTON,f,moderate,f,f,2,4.07
1,953595,https://www.airbnb.com/rooms/953595,20160104002432,2016-01-04,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,none,"Queen Anne is a wonderful, truly functional vi...",...,10.0,f,,WASHINGTON,f,strict,t,t,6,1.48
2,3308979,https://www.airbnb.com/rooms/3308979,20160104002432,2016-01-04,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,none,Upper Queen Anne is a charming neighborhood fu...,...,10.0,f,,WASHINGTON,f,strict,f,f,2,1.15
3,7421966,https://www.airbnb.com/rooms/7421966,20160104002432,2016-01-04,Queen Anne Chateau,A charming apartment that sits atop Queen Anne...,,A charming apartment that sits atop Queen Anne...,none,,...,,f,,WASHINGTON,f,flexible,f,f,1,
4,278830,https://www.airbnb.com/rooms/278830,20160104002432,2016-01-04,Charming craftsman 3 bdm house,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,none,We are in the beautiful neighborhood of Queen ...,...,9.0,f,,WASHINGTON,f,strict,f,f,1,0.89


In [570]:
# Choose the columns from the listing dataset to be joined to the reviews data
add_cols = listings[['street','neighbourhood_cleansed','latitude', 'longitude', 'property_type']]

#merge columns to create a new dataframe
reviews_final = pd.merge(reviews_en, add_cols, 
         left_on = reviews_en['listing_id'], 
         right_on = listings['id'], 
         how='left').drop(['key_0','id','reviewer_id','reviewer_name','comments'], axis =1)

In [577]:
reviews_final.head()

Unnamed: 0,listing_id,date,language,en_comments,sentiment,street,neighbourhood_cleansed,latitude,longitude,property_type
0,7202016,2015-07-19,en,Cute and cozy place. Perfect location to every...,5.0,"3rd Avenue West, Seattle, WA 98119, United States",Lower Queen Anne,47.62621,-122.360147,Apartment
1,7202016,2015-07-20,en,Kelly has a great room in a very central locat...,20.0,"3rd Avenue West, Seattle, WA 98119, United States",Lower Queen Anne,47.62621,-122.360147,Apartment
2,7202016,2015-07-26,en,"Very spacious apartment, and in a great neighb...",8.0,"3rd Avenue West, Seattle, WA 98119, United States",Lower Queen Anne,47.62621,-122.360147,Apartment
3,7202016,2015-08-02,en,Close to Seattle Center and all it has to offe...,3.0,"3rd Avenue West, Seattle, WA 98119, United States",Lower Queen Anne,47.62621,-122.360147,Apartment
4,7202016,2015-08-10,en,Kelly was a great host and very accommodating ...,18.0,"3rd Avenue West, Seattle, WA 98119, United States",Lower Queen Anne,47.62621,-122.360147,Apartment


In [1]:
# Check to ensure no blank values
reviews_final.info()

NameError: name 'reviews_final' is not defined

In [575]:
#Save processed data into a csv file
reviews_final.to_csv("reviews_final.csv")

# Project 4: Exploring the Seattle Airbnb Dataset (Part 2 - Neighbourhood Vibe)

Part 2 of the project dives into trying to understand the vibe of some of the top neighbourhoods through topic modelling. 


In [564]:
# Read in Listings
listings = pd.read_csv(r"C:\Users\Adetomiwa\Desktop\DATA SCIENCE DEGREE\Python\Project 4\seattle\listings.csv")

In [565]:
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 92 columns):
id                                  3818 non-null int64
listing_url                         3818 non-null object
scrape_id                           3818 non-null int64
last_scraped                        3818 non-null object
name                                3818 non-null object
summary                             3641 non-null object
space                               3249 non-null object
description                         3818 non-null object
experiences_offered                 3818 non-null object
neighborhood_overview               2786 non-null object
notes                               2212 non-null object
transit                             2884 non-null object
thumbnail_url                       3498 non-null object
medium_url                          3498 non-null object
picture_url                         3818 non-null object
xl_picture_url                      3498

In [566]:
# we'll select the cleansed neighbouhoods and neighbouhood overview columns 
subset = ["neighbourhood_cleansed","neighborhood_overview"]
listings_new = listings[subset]

In [567]:
listings_new.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 2 columns):
neighbourhood_cleansed    3818 non-null object
neighborhood_overview     2786 non-null object
dtypes: object(2)
memory usage: 59.8+ KB


In [568]:
# Filter out the null values
listings_new = listings_new[listings_new['neighborhood_overview'].notnull()]

In [569]:
# A glance at value counts for each neighbourhood
listings_new['neighbourhood_cleansed'].value_counts()

Broadway             266
Belltown             146
Wallingford          146
Fremont              122
Minor                108
                    ... 
Arbor Heights          4
Holly Park             3
South Beacon Hill      3
Roxhill                2
South Park             1
Name: neighbourhood_cleansed, Length: 87, dtype: int64

**NOTES**

We'll be viewing topics in 3 of the top neighbourhoods

- Broadway
- Belltown
- Wallingford

In [744]:
broadway = listings_new.loc[listings_new['neighbourhood_cleansed'] == 'Broadway']
broadway

Unnamed: 0,neighbourhood_cleansed,neighborhood_overview
2577,Broadway,This location sits in the middle between Capit...
2579,Broadway,The flat is smack dab in the middle of Capitol...
2582,Broadway,We love everything about this (URL HIDDEN) is ...
2584,Broadway,"Capitol Hill is fun for any age, there are man..."
2586,Broadway,"Capitol Hill is one of the fastest growing, ec..."
...,...,...
2964,Broadway,Situated on the eastern bench of downtown our ...
2966,Broadway,Capitol Hill is the best of both worlds--quiet...
2968,Broadway,I live right on 15th avenue where there are so...
2969,Broadway,Capitol Hill is one of the best Seattle neighb...


In [745]:
broadway = broadway.drop('neighbourhood_cleansed', axis =1)

In [746]:
exclude = ['capitol','hill', 'seattle', 'neighbourhood', 'neighborhood','walk','walking']
broadway["neighborhood_overview"] = broadway["neighborhood_overview"].str.lower().str.split().apply(lambda x: [item for item in x if item not in exclude])

In [747]:
broadway

Unnamed: 0,neighborhood_overview
2577,"[this, location, sits, in, the, middle, betwee..."
2579,"[the, flat, is, smack, dab, in, the, middle, o..."
2582,"[we, love, everything, about, this, (url, hidd..."
2584,"[is, fun, for, any, age,, there, are, many, ac..."
2586,"[is, one, of, the, fastest, growing,, eclectic..."
...,...
2964,"[situated, on, the, eastern, bench, of, downto..."
2966,"[is, the, best, of, both, worlds--quiet, resid..."
2968,"[i, live, right, on, 15th, avenue, where, ther..."
2969,"[is, one, of, the, best, neighborhoods!, it, i..."


In [749]:
broadway.to_csv("broadway.csv")

In [769]:
belltown = listings_new.loc[listings_new['neighbourhood_cleansed'] == 'Belltown']
belltown = belltown.drop('neighbourhood_cleansed', axis =1)
belltown["neighborhood_overview"] = belltown["neighborhood_overview"].str.lower().str.split().apply(lambda x: [item for item in x if item not in exclude])
belltown.to_csv("belltown.csv")

In [770]:
wallingford = listings_new.loc[listings_new['neighbourhood_cleansed'] == 'Wallingford']
wallingford = wallingford.drop('neighbourhood_cleansed', axis =1)
wallingford["neighborhood_overview"] = wallingford["neighborhood_overview"].str.lower().str.split().apply(lambda x: [item for item in x if item not in exclude])
wallingford.to_csv("Wallingford.csv")

In [590]:
# Function to tokenize and remove stopwords, stem

from gensim.parsing.preprocessing import preprocess_string

def process_text(df):
    df['text'] = df['neighborhood_overview'].apply(lambda x: preprocess_string(str(x)))
                                      
    return df

In [591]:
process_text(broadway)

Unnamed: 0,neighborhood_overview,text
2577,"[this, location, sits, in, the, middle, betwee...","[locat, sit, middl, lake, union, area, heart, ..."
2579,"[the, flat, is, smack, dab, in, the, middle, o...","[flat, smack, dab, middl, hill, cal, anderson,..."
2582,"[we, love, everything, about, this, (url, hidd...","[love, url, hidden, eclect, need, distanc]"
2584,"[is, fun, for, any, age,, there, are, many, ac...","[fun, ag, activ, coffe, shop, park, bar, resta..."
2586,"[is, one, of, the, fastest, growing,, eclectic...","[fastest, grow, eclect, access, neighborhood, ..."
...,...,...
2964,"[situated, on, the, eastern, bench, of, downto...","[situat, eastern, bench, downtown, amaz, centr..."
2966,"[is, the, best, of, both, worlds--quiet, resid...","[best, world, quiet, residenti, street, featur..."
2968,"[i, live, right, on, 15th, avenue, where, ther...","[live, right, avenu, restaur, bar, coffe, shop..."
2969,"[is, one, of, the, best, neighborhoods!, it, i...","[best, neighborhood, fill, local, cultur, grea..."


In [592]:
result_broad = process_text(broadway).drop('neighborhood_overview',axis =1)

In [593]:
result_broad

Unnamed: 0,text
2577,"[locat, sit, middl, lake, union, area, heart, ..."
2579,"[flat, smack, dab, middl, hill, cal, anderson,..."
2582,"[love, url, hidden, eclect, need, distanc]"
2584,"[fun, ag, activ, coffe, shop, park, bar, resta..."
2586,"[fastest, grow, eclect, access, neighborhood, ..."
...,...
2964,"[situat, eastern, bench, downtown, amaz, centr..."
2966,"[best, world, quiet, residenti, street, featur..."
2968,"[live, right, avenu, restaur, bar, coffe, shop..."
2969,"[best, neighborhood, fill, local, cultur, grea..."


In [735]:
df = pd.Series(sum([item for item in result_broad.text], [])).value_counts().head(10)
df

restaur     189
park        156
bar         152
block       146
downtown    127
shop        126
seattl      118
locat       109
awai        107
minut       102
dtype: int64

In [595]:
from gensim.corpora.dictionary import Dictionary
from gensim.models import HdpModel

broadway_dict = Dictionary(result_broad['text'])

In [596]:
broadway_dict.token2id

{'anderson': 0,
 'area': 1,
 'attract': 2,
 'bring': 3,
 'broadwai': 4,
 'busi': 5,
 'cal': 6,
 'children': 7,
 'citi': 8,
 'cloth': 9,
 'coffeehous': 10,
 'commun': 11,
 'compani': 12,
 'corridor': 13,
 'cut': 14,
 'dip': 15,
 'divers': 16,
 'dog': 17,
 'downtown': 18,
 'easi': 19,
 'edg': 20,
 'exist': 21,
 'farmer': 22,
 'fastest': 23,
 'feel': 24,
 'fiesta': 25,
 'fit': 26,
 'flower': 27,
 'food': 28,
 'fountain': 29,
 'frisbe': 30,
 'furnitur': 31,
 'group': 32,
 'grow': 33,
 'heart': 34,
 'highlight': 35,
 'hip': 36,
 'home': 37,
 'huge': 38,
 'includ': 39,
 'lake': 40,
 'lakefront': 41,
 'life': 42,
 'like': 43,
 'locat': 44,
 'market': 45,
 'middl': 46,
 'music': 47,
 'neighborhood': 48,
 'park': 49,
 'peopl': 50,
 'pike': 51,
 'pine': 52,
 'place': 53,
 'player': 54,
 'season': 55,
 'seattl': 56,
 'shop': 57,
 'sit': 58,
 'small': 59,
 'state': 60,
 'stereotyp': 61,
 'store': 62,
 'stream': 63,
 'toe': 64,
 'town': 65,
 'union': 66,
 'urba': 67,
 'urban': 68,
 'varieti': 69,
 

In [762]:
corpus_broad = [broadway_dict.doc2bow(text) for text in result_broad['text']]

In [765]:
hdp = HdpModel(corpus_broad, id2word = broadway_dict)

In [766]:
topic_info = hdp.print_topics(num_topics=3, num_words=5)
topic_info

[(0, '0.012*minut + 0.008*park + 0.008*restaur + 0.006*fantast + 0.006*walk'),
 (1, '0.007*locat + 0.006*park + 0.005*serv + 0.005*minut + 0.005*outdoor'),
 (2, '0.009*park + 0.009*mile + 0.007*block + 0.006*joe + 0.005*high')]

In [767]:
def find_topics(location):
    area = listings_new.loc[listings_new['neighbourhood_cleansed'] == 'location']
    area = area.drop('neighbourhood_cleansed', axis =1)
    
    exclude = ['capitol','hill', 'seattle', 'neighborhood','walk','walking']
    area["neighborhood_overview"] = area["neighborhood_overview"].str.lower().str.split().apply(lambda x: [item for item in x if item not in exclude])
    
    area['text'] = area['neighborhood_overview'].apply(lambda x: preprocess_string(str(x)))
    area = area.drop('neighborhood_overview',axis =1)

    area_dict = Dictionary(area['text'])

    corpus_area = [area_dict.doc2bow(text) for text in area['text']]
    hdp = HdpModel(corpus_area, id2word = area_dict)

    topic_info = hdp.print_topics(num_topics=3, num_words=5)
    
    return topic_info

In [None]:
find_topics('Belltown')

In [611]:
#Wallingford Test
wall = listings_new.loc[listings_new['neighbourhood_cleansed'] == 'Wallingford']

In [614]:
wall = wall.drop('neighbourhood_cleansed', axis =1)

In [615]:
exclude = ['capitol','hill', 'seattle', 'neighborhood','walk','walking','wallingford','bellfront']
wall["neighborhood_overview"] = wall["neighborhood_overview"].str.lower().str.split().apply(lambda x: [item for item in x if item not in exclude])

In [616]:
wall['text'] = wall['neighborhood_overview'].apply(lambda x: preprocess_string(str(x)))
wall = wall.drop('neighborhood_overview',axis =1)

In [617]:
wall_dict = Dictionary(wall['text'])

In [756]:
corpus_wall = [wall_dict.doc2bow(text) for text in wall['text']]
hdp = HdpModel(corpus_wall, id2word = wall_dict)
topic_info_wall = hdp.show_topics(num_topics=10, num_words=5, formatted=False)

In [757]:
topic_info_wall

[(0,
  [('recreat', 0.007866457416722763),
   ('amen', 0.007420905391991037),
   ('right', 0.006676928894448375),
   ('court', 0.0064210846153119495),
   ('cake', 0.005397969748357705)]),
 (1,
  [('sampl', 0.0070579961781212855),
   ('wok', 0.005911660112992227),
   ('flower', 0.005674174907749075),
   ('gem', 0.005489436426632409),
   ('archi', 0.005465774884714168)]),
 (2,
  [('burn', 0.00718435170616478),
   ('special', 0.005791271737346041),
   ('ag', 0.005575291944363793),
   ('hundr', 0.005130521233112243),
   ('destin', 0.005115991727813704)]),
 (3,
  [('gyro', 0.009713376639321867),
   ('dedic', 0.007401664587675976),
   ('concert', 0.006808882502899116),
   ('glide', 0.006025024286411992),
   ('begin', 0.005476754543783745)]),
 (4,
  [('sushi', 0.006682775525175087),
   ('destin', 0.005902370925294287),
   ('main', 0.005627050643550498),
   ('improv', 0.00548506879872631),
   ('countless', 0.005126100805166888)]),
 (5,
  [('molli', 0.006265550138497504),
   ('easili', 0.006250

In [672]:
from gensim.models.coherencemodel import CoherenceModel

In [673]:
topics = []
for topic_id, topic in topic_info_wall:
    topic = [word for word, _ in topic]
    topics.append(topic)

In [674]:
topics

[['take', 'beer', 'look', 'row', 'ton'],
 ['season', 'serv', 'hope', 'instanc', 'concert'],
 ['pub', 'promis', 'exercis', 'item', 'vietnames'],
 ['roaster', 'awai', 'decommiss', 'unit', 'nester'],
 ['definit', 'grab', 'log', 'hiram', 'archi'],
 ['short', 'cottag', 'futur', 'myriad', 'apart'],
 ['circl', 'action', 'book', 'stellar', 'claim'],
 ['hr', 'wok', 'northgat', 'todai', 'yacht'],
 ['fremont', 'restaur', 'view', 'ethnic', 'nearbi'],
 ['kabul', 'machineri', 'leav', 'major', 'cleaner']]

In [675]:
cm = CoherenceModel(topics=topics, corpus=corpus_wall, dictionary=wall_dict, coherence='u_mass')

In [755]:
cm.get_coherence()

-20.546251894739658

In [723]:
num_topics=[2,3,4,5] 
num_words=[3,4,5]
for i in num_topics:
    for x in num_words:
        corpus_wall = [wall_dict.doc2bow(text) for text in wall['text']]
        hdp = HdpModel(corpus_wall, id2word = wall_dict)
        topic_info_wall = hdp.show_topics(num_topics= i, num_words=x, formatted=False)

        topics = []
        for topic_id, topic in topic_info_wall:
            topic = [word for word, _ in topic]
            topics.append(topic)

        cm = CoherenceModel(topics=topics, corpus=corpus_wall, dictionary=wall_dict, coherence='u_mass')
        score = cm.get_coherence()

        print(i,'topics', x, 'words', 'have a coherence score of', score)

2 topics 3 words have a coherence score of -23.414942858551562
2 topics 4 words have a coherence score of -20.06707262266789
2 topics 5 words have a coherence score of -20.294395863426303
3 topics 3 words have a coherence score of -20.77967182230907
3 topics 4 words have a coherence score of -20.142605544482212
3 topics 5 words have a coherence score of -21.18353017017243
4 topics 3 words have a coherence score of -17.690241578099613
4 topics 4 words have a coherence score of -23.375395168687174
4 topics 5 words have a coherence score of -20.30374140570619
5 topics 3 words have a coherence score of -21.892648258613924
5 topics 4 words have a coherence score of -17.95755941397205
5 topics 5 words have a coherence score of -20.546251894739658
