# Goal

- Detect and select language: Since the review data include many languages but for simplicity, I will select the majority language, English for the later analysis.

In [2]:
import numpy as np
import pandas as pd
from collections import Counter

### Read data created in the previous section (Airbnb_data_merge).

In [3]:
df = pd.read_csv('data/df_new.csv').drop('Unnamed: 0', 1)
print df.info()
# df.head(2)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143237 entries, 0 to 143236
Data columns (total 58 columns):
id                             143237 non-null int64
review_id                      143237 non-null int64
date                           143237 non-null object
reviewer_id                    143237 non-null int64
reviewer_name                  143237 non-null object
comments                       143201 non-null object
listing_url                    143237 non-null object
name                           143237 non-null object
summary                        117304 non-null object
space                          138401 non-null object
description                    143234 non-null object
neighborhood_overview          100051 non-null object
notes                          84125 non-null object
transit                        105613 non-null object
host_id                        143237 non-null int64
host_url                       143237 non-null object
host_name                     

## Detect language 
- Use [langdetect](https://pypi.python.org/pypi/langdetect) to detect languages in reviews

In [3]:
from langdetect import detect

In [4]:
df['Language'] = None
df.head(1)

Unnamed: 0,id,review_id,date,reviewer_id,reviewer_name,comments,listing_url,name,summary,space,...,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,cancellation_policy,reviews_per_month,Language
0,1994427,10612780,2014-02-27,4918890,Katie,Great host. Provided special gift basket upon ...,https://www.airbnb.com/rooms/1994427,One Bedroom apartment,Charming one bedroom in the Mission! Close to...,The bedroom fits two on the full sized bed and...,...,9,9,10,10,10,9,f,strict,0.24,


In [6]:
def run_detect(v): 
    v = str(v)
    try:
        return detect(v)
    except:
#         return detect(v.decode('utf8'))
        return None # If this is the case, mojibake is present in the sentences

def ensure_unicode(v):
    v = str(v)
    try:
        return v.decode('utf8')
    except UnicodeEncodeError:
        return v

In [7]:
# warning: This takes long time to run
df2 = df
df2['Language'] = df2['comments'].apply(ensure_unicode)
df2['Language'] = df2['comments'].apply(run_detect)
df2.tail(100)

Unnamed: 0,id,review_id,date,reviewer_id,reviewer_name,comments,listing_url,name,summary,space,...,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,cancellation_policy,reviews_per_month,Language
143137,1013866,20689829,2014-10-04,8520593,Diana,Helena's place is like a retreat of serenity. ...,https://www.airbnb.com/rooms/1013866,Contemporary Get-Away w/ fab Views,,"This beautiful home is filled with light, high...",...,10,9,10,10,9,9,t,strict,3.41,en
143138,1013866,20985294,2014-10-09,11793114,Sigrid,Wir haben uns in Helena`s Haus sehr wohl gefüh...,https://www.airbnb.com/rooms/1013866,Contemporary Get-Away w/ fab Views,,"This beautiful home is filled with light, high...",...,10,9,10,10,9,9,t,strict,3.41,
143139,1013866,21145042,2014-10-12,15537929,Christopher,I thoroughly enjoyed our stay in San Francisco...,https://www.airbnb.com/rooms/1013866,Contemporary Get-Away w/ fab Views,,"This beautiful home is filled with light, high...",...,10,9,10,10,9,9,t,strict,3.41,en
143140,1013866,21589015,2014-10-20,8639490,Ryan,Excellent stay! Gorgeous views of the east bay...,https://www.airbnb.com/rooms/1013866,Contemporary Get-Away w/ fab Views,,"This beautiful home is filled with light, high...",...,10,9,10,10,9,9,t,strict,3.41,en
143141,1013866,22020625,2014-10-28,3169540,Max,My parents and their friends stayed here when ...,https://www.airbnb.com/rooms/1013866,Contemporary Get-Away w/ fab Views,,"This beautiful home is filled with light, high...",...,10,9,10,10,9,9,t,strict,3.41,en
143142,1013866,22523990,2014-11-09,4510215,Nancy,"Clean, private, self efficient, airy and brigh...",https://www.airbnb.com/rooms/1013866,Contemporary Get-Away w/ fab Views,,"This beautiful home is filled with light, high...",...,10,9,10,10,9,9,t,strict,3.41,en
143143,1013866,22817841,2014-11-15,11017183,David,Helena was easy to deal with - simple process ...,https://www.airbnb.com/rooms/1013866,Contemporary Get-Away w/ fab Views,,"This beautiful home is filled with light, high...",...,10,9,10,10,9,9,t,strict,3.41,en
143144,1013866,23634450,2014-12-07,4598339,Sunny,Really beautiful place!!! I would definitely s...,https://www.airbnb.com/rooms/1013866,Contemporary Get-Away w/ fab Views,,"This beautiful home is filled with light, high...",...,10,9,10,10,9,9,t,strict,3.41,en
143145,1013866,24052899,2014-12-18,35004,Ati,Helena's home was lovely. We had everything we...,https://www.airbnb.com/rooms/1013866,Contemporary Get-Away w/ fab Views,,"This beautiful home is filled with light, high...",...,10,9,10,10,9,9,t,strict,3.41,en
143146,1013866,24453885,2014-12-28,9727230,Daniel,This was a home away from home in the city. Ho...,https://www.airbnb.com/rooms/1013866,Contemporary Get-Away w/ fab Views,,"This beautiful home is filled with light, high...",...,10,9,10,10,9,9,t,strict,3.41,en


- We have a lot of languages and mojibake (labeled as "None") present in the reivews but most of the reviews are English.
- Let's choose only English 

In [9]:
Counter(df2.Language)

Counter({None: 10104,
         u'af': 38,
         u'ca': 22,
         u'cs': 11,
         u'cy': 6,
         u'da': 7,
         u'de': 87,
         u'en': 132332,
         u'es': 134,
         u'fr': 107,
         u'hr': 4,
         u'hu': 8,
         u'id': 3,
         u'it': 61,
         u'nl': 135,
         u'no': 5,
         u'pl': 8,
         u'pt': 9,
         u'ro': 67,
         u'sl': 1,
         u'so': 22,
         u'sq': 1,
         u'sv': 1,
         u'sw': 6,
         u'tl': 50,
         u'tr': 3,
         u'vi': 5})

In [22]:
df3 = df[df.Language == 'en']
df3.drop(['Language'],1)

Unnamed: 0,id,review_id,date,reviewer_id,reviewer_name,comments,listing_url,name,summary,space,...,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,cancellation_policy,reviews_per_month
0,1994427,10612780,2014-02-27,4918890,Katie,Great host. Provided special gift basket upon ...,https://www.airbnb.com/rooms/1994427,One Bedroom apartment,Charming one bedroom in the Mission! Close to...,The bedroom fits two on the full sized bed and...,...,92,9,9,10,10,10,9,f,strict,0.24
1,1994427,49925683,2015-10-07,44613424,Steven,Great host. Met me at apartment. Place was cle...,https://www.airbnb.com/rooms/1994427,One Bedroom apartment,Charming one bedroom in the Mission! Close to...,The bedroom fits two on the full sized bed and...,...,92,9,9,10,10,10,9,f,strict,0.24
2,1994427,50537634,2015-10-12,7616696,Marsilius,Syeda hosted my parents for four days during t...,https://www.airbnb.com/rooms/1994427,One Bedroom apartment,Charming one bedroom in the Mission! Close to...,The bedroom fits two on the full sized bed and...,...,92,9,9,10,10,10,9,f,strict,0.24
3,1994427,51077613,2015-10-17,16703590,Robert,I had a great experience at Syeda's Airbnb! It...,https://www.airbnb.com/rooms/1994427,One Bedroom apartment,Charming one bedroom in the Mission! Close to...,The bedroom fits two on the full sized bed and...,...,92,9,9,10,10,10,9,f,strict,0.24
4,1994427,52109034,2015-10-26,6183409,Steve,"Great location! Warm, welcoming host. This w...",https://www.airbnb.com/rooms/1994427,One Bedroom apartment,Charming one bedroom in the Mission! Close to...,The bedroom fits two on the full sized bed and...,...,92,9,9,10,10,10,9,f,strict,0.24
5,1774461,8865666,2013-11-22,10034687,Alex,Lindsay and Agnes were great! Such a wonderfu...,https://www.airbnb.com/rooms/1774461,"Spacious Victorian, Great Location!",3 bedroom (1300 sq. ft) flat with beautiful Sa...,This place is unbelievably spacious compared t...,...,100,10,10,10,10,10,10,f,moderate,0.04
8,4511292,44548089,2015-08-27,38585635,Tara,We very much enjoyed our stay at Doug's home. ...,https://www.airbnb.com/rooms/4511292,Modern Home in Glen Park,Clean comfortable home in Glen Park. Close to...,,...,100,10,10,9,9,10,10,f,flexible,0.76
9,8613580,50042427,2015-10-08,4807228,Bibi,"Loved having our own SF apartment. Very clean,...",https://www.airbnb.com/rooms/8613580,Peaceful Apartment with Great Views,"Convenient to N-Judah, Golden Gate Park, Irvin...",,...,93,9,10,10,10,9,9,f,flexible,3.00
10,8613580,50386161,2015-10-11,44866256,Soyoung,Chelsea was great! Very communicative in coord...,https://www.airbnb.com/rooms/8613580,Peaceful Apartment with Great Views,"Convenient to N-Judah, Golden Gate Park, Irvin...",,...,93,9,10,10,10,9,9,f,flexible,3.00
11,8613580,50704332,2015-10-13,1670849,Levi,"Chelsea's apartment is a clean, comfortable, q...",https://www.airbnb.com/rooms/8613580,Peaceful Apartment with Great Views,"Convenient to N-Judah, Golden Gate Park, Irvin...",,...,93,9,10,10,10,9,9,f,flexible,3.00


## Export with csv format

In [77]:
# df3.to_csv('data/df_new_en.csv')

In [80]:
len(df3)

132332

In [79]:
len(df)

143237