## The Most Frequency Words in Review

The jupyter notebook is to analyse what the most frequency words used by reviewers to describe their experience in Airbnb lodgings and homestays. We focus on nouns, adjectives, adverbs in reviews to decribe a listing property. We used reviews.csv dataset.

### Read the Data

In [1]:
import nltk
import pandas as pd
import numpy as np
#nltk.download('punkt')
#nltk.download('stopwords')
#nltk.download('averaged_perceptron_tagger')
data_reviews=pd.read_csv('reviews.csv')
data_reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...
1,7202016,39087409,2015-07-20,32440555,Frank,Kelly has a great room in a very central locat...
2,7202016,39820030,2015-07-26,37722850,Ian,"Very spacious apartment, and in a great neighb..."
3,7202016,40813543,2015-08-02,33671805,George,Close to Seattle Center and all it has to offe...
4,7202016,41986501,2015-08-10,34959538,Ming,Kelly was a great host and very accommodating ...


### Data Processing

In [2]:
#tokenize reviews 
data_reviews['comments_token']=data_reviews['comments'].astype('str').map(lambda comment:nltk.word_tokenize(comment))
data_reviews.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,comments_token
0,7202016,38917982,2015-07-19,28943674,Bianca,Cute and cozy place. Perfect location to every...,"[Cute, and, cozy, place, ., Perfect, location,..."
1,7202016,39087409,2015-07-20,32440555,Frank,Kelly has a great room in a very central locat...,"[Kelly, has, a, great, room, in, a, very, cent..."
2,7202016,39820030,2015-07-26,37722850,Ian,"Very spacious apartment, and in a great neighb...","[Very, spacious, apartment, ,, and, in, a, gre..."
3,7202016,40813543,2015-08-02,33671805,George,Close to Seattle Center and all it has to offe...,"[Close, to, Seattle, Center, and, all, it, has..."
4,7202016,41986501,2015-08-10,34959538,Ming,Kelly was a great host and very accommodating ...,"[Kelly, was, a, great, host, and, very, accomm..."


In [3]:
#remove the stopwords 
stops=set(nltk.corpus.stopwords.words('English'))
data_reviews['comments_text']=data_reviews['comments_token'].map(lambda comments_token:[word for word in comments_token if word.isalpha() and word not in stops])
data_reviews[['listing_id','comments_token','comments_text']].head()

Unnamed: 0,listing_id,comments_token,comments_text
0,7202016,"[Cute, and, cozy, place, ., Perfect, location,...","[Cute, cozy, place, Perfect, location, everyth..."
1,7202016,"[Kelly, has, a, great, room, in, a, very, cent...","[Kelly, great, room, central, location, Beauti..."
2,7202016,"[Very, spacious, apartment, ,, and, in, a, gre...","[Very, spacious, apartment, great, neighborhoo..."
3,7202016,"[Close, to, Seattle, Center, and, all, it, has...","[Close, Seattle, Center, offer, ballet, theate..."
4,7202016,"[Kelly, was, a, great, host, and, very, accomm...","[Kelly, great, host, accommodating, great, nei..."


In [4]:
#aggregate reviews of one listing_id 
listing=set(data_reviews['listing_id'])
comments_text_new=[]
listing_id_new=[]
for l in listing:
    #print (l)
    listing_id_new.append(l)
    comment=[]
    listing_data=data_reviews[data_reviews.listing_id==l].copy()
    for c in range(len(listing_data)):
        comment=comment+listing_data['comments_text'].iloc[c]
        
    comments_text_new.append(comment)
    #print (comments_text_new)
print ('words to decribe listing{}'.format(listing_id_new[0]))
print (comments_text_new[0])
#print (listing_id_new[0])


words to decribe listing2727938
['Jessica', 'place', 'delightfully', 'cozy', 'close', 'many', 'awesome', 'sights', 'bars', 'restaurants', 'Her', 'apartment', 'feels', 'homey', 'relaxing', 'perfect', 'fit', 'stays', 'Jessica', 'easy', 'communicate', 'provided', 'everything', 'needed', 'stay', 'Overall', 'great', 'experience', 'I', 'would', 'definitely', 'stay', 'Loved', 'Loved', 'Loved', 'place', 'It', 'totally', 'felt', 'like', 'I', 'coming', 'home', 'apartment', 'Had', 'amazing', 'character', 'I', 'feel', 'like', 'location', 'PERFECT', 'Not', 'far', 'anything', 'Close', 'bus', 'lines', 'get', 'around', 'really', 'safe', 'area', 'Would', 'totally', 'stay', 'Jessica', 'apartment', 'great', 'It', 'quiet', 'street', 'still', 'close', 'enough', 'walk', 'everything', 'Awesome', 'view', 'We', 'really', 'great', 'time', 'Seattle', 'part', 'Jessica', 'place', 'The', 'location', 'really', 'convenient', 'place', 'spacious', 'studio', 'Jessica', 'great', 'host', 'Super', 'helpful', 'getting', 'us

In [5]:
#new dataset data_reviews_agg_df to store aggregated reviews 
data_reviews_agg_df=pd.DataFrame(listing_id_new,columns=['listing_id_new'])
data_reviews_agg_df.insert(loc=data_reviews_agg_df.shape[1],column='comments_text_new',value=comments_text_new)

In [6]:
data_reviews_agg_df.head()

Unnamed: 0,listing_id_new,comments_text_new
0,2727938,"[Jessica, place, delightfully, cozy, close, ma..."
1,598023,"[Jeff, friendly, host, We, chance, go, togethe..."
2,794633,"[Great, space, good, location, gracious, host,..."
3,958475,"[When, arrived, Annex, everything, arranged, l..."
4,7381005,"[Magic, Jeff, place, incredible, location, per..."


In [7]:
#part of speech tagging for each words
data_reviews_agg_df['comments_pos']=data_reviews_agg_df['comments_text_new'].map(lambda comment:nltk.pos_tag(comment))
data_reviews_agg_df.head()

Unnamed: 0,listing_id_new,comments_text_new,comments_pos
0,2727938,"[Jessica, place, delightfully, cozy, close, ma...","[(Jessica, NNP), (place, NN), (delightfully, R..."
1,598023,"[Jeff, friendly, host, We, chance, go, togethe...","[(Jeff, NNP), (friendly, RB), (host, VBD), (We..."
2,794633,"[Great, space, good, location, gracious, host,...","[(Great, JJ), (space, NN), (good, JJ), (locati..."
3,958475,"[When, arrived, Annex, everything, arranged, l...","[(When, WRB), (arrived, VBN), (Annex, NNP), (e..."
4,7381005,"[Magic, Jeff, place, incredible, location, per...","[(Magic, NNP), (Jeff, NNP), (place, NN), (incr..."


We are only interested in nouns('NN','NNS','NNP','NNPS'),adjectives('JJ','JJR','JJS') and adverbs('RB','RBR','RBS').Firstly we calculate nouns ,adjectives and adverbs in the first listing_id and see the most frequency words are 'great'(19 times),'Jessica'(16times),'apartment'(11times), and so on .

In [8]:
tags=set(['JJ','JJR','JJS','NN','NNS','NNP','NNPS','RB','RBR','RBS'])
cfd=nltk.ConditionalFreqDist()
for word,tag in data_reviews_agg_df['comments_pos'].iloc[0]:
    if tag in tags:
        cfd['general'][word]+=1
        if tag in ['JJ','JJR','JJS']:
            cfd['adj'][word]+=1
        elif tag in ['NN','NNS','NNP','NNPS']:
            cfd['noun'][word]+=1
        elif tag in ['RB','RBR','RBS']:
            cfd['adv'][word]+=1
    
cfd['general'].tabulate()

          great         Jessica       apartment           place            easy           close        location      everything          really            host    neighborhood            many            time         Seattle        distance           space            bars         perfect            stay      definitely           Loved            area            walk            view         Capitol            Hill             art           walls         weekend     restaurants          little        probably         windows            nice             lot            lots         awesome     communicate         Overall      experience         totally             far          enough            part      convenient             min        downtown           quick         jessica           stars           visit          minute           noise            open             bad           super           clean         exactly           whole          plants            well      absolutely         

Now we calculate words frequency for the data_reviews_add_df .

In [9]:
def conditionfrequency(comments_pos):
    tags=set(['JJ','JJR','JJS','NN','NNS','NNP','NNPS','RB','RBR','RBS'])
    cfd=nltk.ConditionalFreqDist()
    for word,tag in comments_pos:
        if tag in tags:
            cfd['general'][word]+=1
            if tag in ['JJ','JJR','JJS']:
                cfd['adj'][word]+=1
            elif tag in ['NN','NNS','NNP','NNPS']:
                cfd['noun'][word]+=1
            elif tag in ['RB','RBR','RBS']:
                cfd['adv'][word]+=1
    return cfd
data_reviews_agg_df['conditionfrequency']=data_reviews_agg_df['comments_pos'].map(conditionfrequency)


We not only calculate each category of nouns, adjectives and adverb but also calculate  'general' words frequency for mix of the nouns ,adjectives and adverbs.

In [10]:
data_reviews_agg_df.iloc[0].conditionfrequency.conditions()

['general', 'noun', 'adv', 'adj']

We see the top 5 words to describe first listing_id is as following:

In [11]:
t=data_reviews_agg_df.iloc[0]['conditionfrequency']['general'].items()
topword=[key for key,value in t ]
print ('The top 5 words to decribe the first listing_id is ',topword[:5])

The top 5 words to decribe the first listing_id is  ['Jessica', 'place', 'delightfully', 'close', 'many']


In [12]:
t=data_reviews_agg_df.iloc[0]['conditionfrequency']['noun'].items()
topword=[key for key,value in t ]
print ('The top 5 nouns to decribe the first listing_id is ',topword[:5])

The top 5 nouns to decribe the first listing_id is  ['Jessica', 'place', 'sights', 'bars', 'apartment']


In [13]:
t=data_reviews_agg_df.iloc[0]['conditionfrequency']['adj'].items()
topword=[key for key,value in t ]
print ('The top 5 adjectives to decribe the first listing_id is ', topword[:5])

The top 5 adjectives to decribe the first listing_id is  ['close', 'many', 'awesome', 'perfect', 'fit']


In [14]:
t=data_reviews_agg_df.iloc[0]['conditionfrequency']['adv'].items()
topword=[key for key,value in t ]
print ('The top 5 adverbs to decribe the first listing_id is ', topword[:5])

The top 5 adverbs to decribe the first listing_id is  ['delightfully', 'definitely', 'totally', 'home', 'Not']


We calculate top words for each category and store them in data_reviews_agg_df .

In [15]:
#calculate top words in each category: general,noun,adjective,adverb
topword_general=[]
topword_adj=[]
topword_adv=[]
topword_noun=[]
for key,value in data_reviews_agg_df['conditionfrequency'].items():
    t1=value['general'].items()
    words1=[k for k,v in t1]
    topword_general.append(words1[:5])
    t2=value['adj'].items()
    words2=[k for k,v in t2]
    topword_adj.append(words2[:5])
    t3=value['adv'].items()
    words3=[k for k,v in t3]
    topword_adv.append(words3[:5])
    t4=value['noun'].items()
    words4=[k for k,v in t4]
    topword_noun.append(words4[:5])


In [16]:
data_reviews_agg_df.insert(loc=data_reviews_agg_df.shape[1],column='top_general_words',value=topword_general)
data_reviews_agg_df.insert(loc=data_reviews_agg_df.shape[1],column='top_adj_words',value=topword_adj)
data_reviews_agg_df.insert(loc=data_reviews_agg_df.shape[1],column='top_adv_words',value=topword_adv)
data_reviews_agg_df.insert(loc=data_reviews_agg_df.shape[1],column='top_noun_words',value=topword_noun)
data_reviews_agg_df.head()

Unnamed: 0,listing_id_new,comments_text_new,comments_pos,conditionfrequency,top_general_words,top_adj_words,top_adv_words,top_noun_words
0,2727938,"[Jessica, place, delightfully, cozy, close, ma...","[(Jessica, NNP), (place, NN), (delightfully, R...","{'general': {'Jessica': 16, 'place': 9, 'delig...","[Jessica, place, delightfully, close, many]","[close, many, awesome, perfect, fit]","[delightfully, definitely, totally, home, Not]","[Jessica, place, sights, bars, apartment]"
1,598023,"[Jeff, friendly, host, We, chance, go, togethe...","[(Jeff, NNP), (friendly, RB), (host, VBD), (We...","{'general': {'Jeff': 6, 'friendly': 5, 'chance...","[Jeff, friendly, chance, together, lunch]","[lunch, great, big, gracious, downtown]","[friendly, together, also, lovely, soon]","[Jeff, chance, beach, area, time]"
2,794633,"[Great, space, good, location, gracious, host,...","[(Great, JJ), (space, NN), (good, JJ), (locati...","{'general': {'Great': 46, 'space': 32, 'good':...","[Great, space, good, location, gracious]","[Great, good, gracious, wine, modern]","[Highly, friendly, visit, exactly, Very]","[space, location, host, studio, apartment]"
3,958475,"[When, arrived, Annex, everything, arranged, l...","[(When, WRB), (arrived, VBN), (Annex, NNP), (e...","{'general': {'Annex': 20, 'everything': 31, 'e...","[Annex, everything, entrance, keypad, entry]","[keypad, automatic, first, spacious, array]","[fully, well, impeccably, always, promptly]","[Annex, everything, entrance, entry, light]"
4,7381005,"[Magic, Jeff, place, incredible, location, per...","[(Magic, NNP), (Jeff, NNP), (place, NN), (incr...","{'general': {'Magic': 4, 'Jeff': 5, 'place': 3...","[Magic, Jeff, place, incredible, location]","[incredible, great, clean, stylish, comfortable]","[especially, extremely, even, definitely, really]","[Magic, Jeff, place, location, perfect]"


Now we can easily access top words by listing_id

In [17]:
data_reviews_agg_df[data_reviews_agg_df['listing_id_new']==598023][['listing_id_new','comments_text_new','top_general_words','top_adv_words','top_adj_words','top_noun_words']]

Unnamed: 0,listing_id_new,comments_text_new,top_general_words,top_adv_words,top_adj_words,top_noun_words
1,598023,"[Jeff, friendly, host, We, chance, go, togethe...","[Jeff, friendly, chance, together, lunch]","[friendly, together, also, lovely, soon]","[lunch, great, big, gracious, downtown]","[Jeff, chance, beach, area, time]"
