The task at hand is to predict Airbnb prices given certain metrics in NYC. 

**NOTE**: This project is done solely for personal practice. Please let me know where I can improve on or if there are any questions to my thinking. Thank you!

In [1]:
#from pandas import read_csv
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from pandas.plotting import scatter_matrix
from matplotlib import pyplot

<h3> Obtaining Data

In [2]:
df = pd.read_csv('AB_NYC_2019.csv')
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


In [3]:
columns = df.columns
columns, len(columns)

(Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
        'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
        'minimum_nights', 'number_of_reviews', 'last_review',
        'reviews_per_month', 'calculated_host_listings_count',
        'availability_365'],
       dtype='object'),
 16)

<h3> Exploring Data

In [4]:
df.shape

(48895, 16)

We have 16 total features and ~49000 samples. Not all of the features are useful and we are also unsure if all samples are valid (i.e. presence of NaN) so we will most likely have to perform feature engineering to obtain more useful data. Let's see if we can get any insights from the data before making any changes to it. <br>

Note that majority of the features are self-explanatory, but just for clarification:
1. **'calculated_host_listings_count'** indicates the total number of apartments and bedrooms referred to the same landlord (one landlord can have more than one property in NY) [1]
2. **'availability_365'** indicates the # of days in an year that the airbnb is available. It is possible for this value to be 0 [2]

[1]https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/discussion/120300 <br>
[2]https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data/discussion/111835

In [5]:
# checking data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

In [6]:
# checking # of NaN in each column
if(df.isnull().values.any() == True):
    for name in columns:
        print('%s: %i' %(name, df[name].isnull().sum()) )

id: 0
name: 16
host_id: 0
host_name: 21
neighbourhood_group: 0
neighbourhood: 0
latitude: 0
longitude: 0
room_type: 0
price: 0
minimum_nights: 0
number_of_reviews: 0
last_review: 10052
reviews_per_month: 10052
calculated_host_listings_count: 0
availability_365: 0


Majority of the NaN are in the last_review and reviews_per_month feature. This makes sense because last_review is a date type and some error most likely occured during the data gathering portion. reviews_per_month are NaN because of a division by 0 error. 

As of now, we can drop a couple of columns that (in my honest opinion) are not that important:
1. id
2. host_id
3. host_name
4. last_review



In [7]:
df.drop(columns = ['id', 'host_id', 'host_name', 'last_review'], inplace = True)
df.head()

Unnamed: 0,name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Clean & quiet apt home by the park,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,6,365
1,Skylit Midtown Castle,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355
2,THE VILLAGE OF HARLEM....NEW YORK !,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,1,365
3,Cozy Entire Floor of Brownstone,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,1,194
4,Entire Apt: Spacious Studio/Loft by central park,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,0.1,1,0


We left some columns over that might potentially be useful. If not, we will simply drop them later. First, let's deal with the NaN and potentially invalid data.

As mentioned earlier, availability_365 can equal 0, meaning that the owner is never open for an year. It is unknown whether the host is just closed or if it's bad data, but we will remove samples where availability_365 = 0. There are no NaN values in this column so we won't need to handle that case. 

In [8]:
df_1 = df[df['availability_365'] > 0]
df_1

Unnamed: 0,name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Clean & quiet apt home by the park,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,6,365
1,Skylit Midtown Castle,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355
2,THE VILLAGE OF HARLEM....NEW YORK !,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,1,365
3,Cozy Entire Floor of Brownstone,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,1,194
5,Large Cozy 1 BR Apartment In Midtown East,Manhattan,Murray Hill,40.74767,-73.97500,Entire home/apt,200,3,74,0.59,1,129
...,...,...,...,...,...,...,...,...,...,...,...,...
48890,Charming one bedroom - newly renovated rowhouse,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,2,9
48891,Affordable room in Bushwick/East Williamsburg,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,2,36
48892,Sunny Studio at Historical Neighborhood,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,1,27
48893,43rd St. Time Square-cozy single bed,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,6,2


**NOTE**: ~17000 samples were removed

Now changing NaN values in the reviews_per_month into 0 

In [9]:
df_1['reviews_per_month'] = df['reviews_per_month'].fillna(0)
df_1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Clean & quiet apt home by the park,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,0.21,6,365
1,Skylit Midtown Castle,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,0.38,2,355
2,THE VILLAGE OF HARLEM....NEW YORK !,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,0.00,1,365
3,Cozy Entire Floor of Brownstone,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,4.64,1,194
5,Large Cozy 1 BR Apartment In Midtown East,Manhattan,Murray Hill,40.74767,-73.97500,Entire home/apt,200,3,74,0.59,1,129
...,...,...,...,...,...,...,...,...,...,...,...,...
48890,Charming one bedroom - newly renovated rowhouse,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,0.00,2,9
48891,Affordable room in Bushwick/East Williamsburg,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,0.00,2,36
48892,Sunny Studio at Historical Neighborhood,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,0.00,1,27
48893,43rd St. Time Square-cozy single bed,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,0.00,6,2


We had 4 NaN values in the name category as well. We will simply fill these with blanks (i.e. '')

In [10]:
df = df_1.copy()
df['name'] = df_1['name'].fillna('to') # we are using 'to' bc it is a stopword -> explained later

In [11]:
# checking our NaN values again
df.isnull().values.any() 

False

As of this moment, I have no idea how to incorporate latitude and longitude into my model. I think it's a fair assumption to make that latitude and longitude are not normal metrics used when people are considering prices n such. In addition, these two columns are essentially incorporated in the neighbourhood_group and neighbourhood columns. 

In [12]:
df.drop(columns = ['latitude', 'longitude'], inplace = True)
df.head()

Unnamed: 0,name,neighbourhood_group,neighbourhood,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Clean & quiet apt home by the park,Brooklyn,Kensington,Private room,149,1,9,0.21,6,365
1,Skylit Midtown Castle,Manhattan,Midtown,Entire home/apt,225,1,45,0.38,2,355
2,THE VILLAGE OF HARLEM....NEW YORK !,Manhattan,Harlem,Private room,150,3,0,0.0,1,365
3,Cozy Entire Floor of Brownstone,Brooklyn,Clinton Hill,Entire home/apt,89,1,270,4.64,1,194
5,Large Cozy 1 BR Apartment In Midtown East,Manhattan,Murray Hill,Entire home/apt,200,3,74,0.59,1,129


<h3> Encoding + Feature Engineering 

We will not be able to use string types in our model, so we will have to encode room_type, neighbourhood_group and neighbourhood. 

In addition, the name that the airbnb is titled is (once again in my opinion) an important factor that people consider when going through AirBnb's (no one wants to rent an airbnb called craphole). So we will perform natural language processing on this column and perform sentiment analysis to obtain some kind of score. Note that there are several flaws with this plan such as the fact that there is no dictionary available to map terms specific for real estate and property to a value. Also, given how small the strings are in the name column (character wise), there might be large variance in sentiment values.



<h4> Encoding

In [13]:
df['room_type'].unique()

array(['Private room', 'Entire home/apt', 'Shared room'], dtype=object)

In [14]:
df['neighbourhood_group'].unique()

array(['Brooklyn', 'Manhattan', 'Queens', 'Staten Island', 'Bronx'],
      dtype=object)

In [15]:
df['neighbourhood'].unique()

array(['Kensington', 'Midtown', 'Harlem', 'Clinton Hill', 'Murray Hill',
       "Hell's Kitchen", 'Chinatown', 'Upper West Side', 'South Slope',
       'Williamsburg', 'Fort Greene', 'Chelsea', 'Crown Heights',
       'East Harlem', 'Park Slope', 'Bedford-Stuyvesant',
       'Windsor Terrace', 'Inwood', 'East Village', 'Greenpoint',
       'Bushwick', 'Flatbush', 'Lower East Side',
       'Prospect-Lefferts Gardens', 'Long Island City', 'Kips Bay',
       'SoHo', 'Upper East Side', 'Prospect Heights',
       'Washington Heights', 'Woodside', 'Brooklyn Heights',
       'Carroll Gardens', 'West Village', 'Gowanus', 'Flatlands',
       'Flushing', 'Boerum Hill', 'Sunnyside', 'DUMBO', 'St. George',
       'Highbridge', 'Ridgewood', 'Morningside Heights', 'Jamaica',
       'Middle Village', 'NoHo', 'Ditmars Steinway', 'Cobble Hill',
       'Flatiron District', 'Roosevelt Island', 'Greenwich Village',
       'East Flatbush', 'Tompkinsville', 'Astoria', 'Clason Point',
       'Eastchester', '

There are no copied values or invalid values (from a glance) so we will not have to perform any cleaning

In [16]:
# taking columns with an object type only
object_cols = df.select_dtypes(include = [object])

# dropping the name column bc we do not want to perform encoding on it
object_cols.drop(columns = ['name'], inplace = True)
object_cols.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,neighbourhood_group,neighbourhood,room_type
0,Brooklyn,Kensington,Private room
1,Manhattan,Midtown,Entire home/apt
2,Manhattan,Harlem,Private room
3,Brooklyn,Clinton Hill,Entire home/apt
5,Manhattan,Murray Hill,Entire home/apt


In [17]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
object_cols = encoder.fit_transform(object_cols)

<h4> Sentiment Analysis

Now we will perform sentiment analysis on the name column. Remember that we added 'to' to the NaN in the name column.  VADER will be our chosen library for sentiment analysis. It is capable of picking up on use of capitalization, slangs and emjois to some extent, making it more accurate for informal writings such as this.

In [18]:
# creating a function to remove stop words to decrease our runtime
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize 
def stop_words(text):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(text)
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    return filtered_sentence

Now let's go through our name column one last time. Since our sentiment analysis will only be applied to english, we should check to see if all the titles are in english. We will be using langdetect for this.

In [19]:
from langdetect import detect
from langdetect.lang_detect_exception import LangDetectException

description = df['name'].astype(str)

errors = []
not_english = []
english = 0
other = 0
for title in description:
    try:
        #print(title)
        if detect(title) == 'en':
            english += 1
        else:
            not_english += [title]
            other += 1
    except LangDetectException:
        other += 1
        errors += [title]
english, other

(26299, 5063)

Not all of the descriptions of the airbnbs get detected as english even though they are in english, which might create problems for our sentiment analysis. This issue is most likely caused by langdetect having too many languages it is capable of detecting. Couple other factors is that people use slangs, emojis and mix languages together, making it a difficult task to pinpoint the exact language. 

In [20]:
not_english

['THE VILLAGE OF HARLEM....NEW YORK !',
 'CBG Helps Haiti Room#2.5',
 'SPACIOUS, LOVELY FURNISHED MANHATTAN BEDROOM',
 'Modern 1 BR / NYC / EAST VILLAGE',
 'back room/bunk beds',
 'Large B&B Style rooms',
 'West Side Retreat',
 'BEST BET IN HARLEM',
 '* ORIGINAL BROOKLYN LOFT *',
 'Williamsburg 1 bedroom Apartment',
 'Room in Greenpoint Loft w/ Roof',
 '2 bedroom - Upper East Side-great for kids',
 'BLUE TRIM GUEST HOUSE',
 'Manhattan Room',
 'Entire 2 Bedroom - Large & Sunny',
 '2 bedroom Williamsburg Apt - Bedford L stop',
 'Cozy Bedroom in Williamsburg 3 BR',
 'COZY QUIET room 4 DOOGLERS!',
 'House On Henry (3rd FLR Suite)',
 'Beautiful Duplex Apartment',
 'Beautiful Apartment East Village',
 'Sugar Hill Rest Stop ',
 'ACCOMMODATIONS GALORE #1',
 'Unique & Charming small 1br Apt. LES',
 'Luminous Beautiful West Village Studio',
 'Modern Greenpoint, Brooklyn Apt',
 'FLAT MACDONOUGH GARDEN',
 'NYC Zen',
 'Cozy BR in Wiliamsburg 3 Bedroom',
 'West Inn 2 - East Village',
 'BROOKLYN VICT

A method around this is to limit the # of possible languages (using langid) and for entries that are not in english, we can remove those titles and give a sentiment score equivalent to the average of the respective neighborhood. From the world_atlas, the most common languages spoken in NYC are: English, Spanish/Spanish Creole and Chinese [2]. To get more accurate results, we should remove any characters that are not used in proper english (i.e. hypens, slashes, asterisks etc.)

[2] https://www.worldatlas.com/articles/how-many-languages-are-spoken-in-nyc.html

In [21]:
def grammar(string):
    # add whatever else you think you might have to account for
    result = str(string)
    result = result.replace('/', ' ')
    result = result.replace('*', ' ')
    result = result.replace('&', ' ')
    result = result.replace('>', ' ')
    result = result.replace('<', ' ')
    result = result.replace('-', ' ')
    result = result.replace('...', ' ')
    result = result.replace('@', ' ')
    result = result.replace('#', ' ')
    result = result.replace('-', ' ')
    result = result.replace('$', ' ')
    result = result.replace('%', ' ')
    result = result.replace('+', ' ')
    result = result.replace('=', ' ')
    
    return result

In [22]:
import langid
desc = description.apply(grammar)
langid.set_languages(['en', 'es', 'zh'])
not_en = []
not_en_index = []
i = 0
for title in desc:
    if langid.classify(title)[0] != 'en':
        not_en += [title]
        not_en_index += [desc.index[i]]
    i += 1
    
len(not_en)

1745

While a couple of the names that were categorized as not english are still in english, we were still able to obtain a greater # of accurate samples.

In [23]:
not_en

['Spacious 1 bedroom in luxe building',
 'West Side Retreat',
 'Sunny   Spacious Chelsea Apartment',
 'Spacious luminous apt Upper West NYC',
 'Spacious Prospect Heights Apartment',
 'Sun drenched, artsy modernist 1 BDRM duplex',
 'Bright Spacious Luxury Condo',
 'Designer 1 BR Duplex w  Terrace  Spectacular Views',
 'Spacious Williamsburg Share w  LOFT BED',
 'Spacious 1BR, Adorable Clean Quiet',
 "DOMINIQUE'S NY mini efficiency  wifi metro quiet",
 'Bright Spacious Williamsburg Abode!',
 'Lovely, Modern, Garden Apartment',
 'Spacious Loft in Clinton Hill',
 'UES Quiet   Spacious 1 bdrm for 4',
 'Spacious Brooklyn Loft   2 Bedroom',
 'Park Slope Apt:, Spacious 2 bedroom',
 '☆Massive DUPLEX☆ 2BR   2BTH East Village 9  Guests',
 'Luxe, Spacious 2BR 2BA Nr Trains',
 'Private, spacious room in Brooklyn',
 'Greenpoint Spacious Loft',
 'Large  Loft Style  Studio  Space',
 'Cool   Spacious Harlem Artist Flat',
 '☆ STUDIO East Village ☆ Own bath! ☆ Sleeps 4 ☆',
 'Sunny, quiet, legal homelike 

Now we will turn these into empty string lists so when the sentiment analysis is performed, a score of 0 will be obtained.

In [24]:
for i in desc.index:
    if desc[i] in not_en:
        desc[i] = ''

Now removing stop words

In [25]:
description = desc.apply(stop_words)
description 

0                          [Clean, quiet, apt, home, park]
1                                [Skylit, Midtown, Castle]
2                [THE, VILLAGE, OF, HARLEM, .NEW, YORK, !]
3                        [Cozy, Entire, Floor, Brownstone]
5        [Large, Cozy, 1, BR, Apartment, In, Midtown, E...
                               ...                        
48890    [Charming, one, bedroom, newly, renovated, row...
48891     [Affordable, room, Bushwick, East, Williamsburg]
48892            [Sunny, Studio, Historical, Neighborhood]
48893         [43rd, St., Time, Square, cozy, single, bed]
48894           [Trendy, duplex, heart, Hell, 's, Kitchen]
Name: name, Length: 31362, dtype: object

To use VADER, we have to put our names into a single string

In [26]:
# creating a function to convert a list of strings to single string
def to_single_string(list_of_strings):
    result = ''
    for string in list_of_strings:
        result += ' ' +string
    return result

# applying above function
description = description.apply(to_single_string)
description

0                             Clean quiet apt home park
1                                 Skylit Midtown Castle
2                     THE VILLAGE OF HARLEM .NEW YORK !
3                          Cozy Entire Floor Brownstone
5             Large Cozy 1 BR Apartment In Midtown East
                              ...                      
48890     Charming one bedroom newly renovated rowhouse
48891        Affordable room Bushwick East Williamsburg
48892              Sunny Studio Historical Neighborhood
48893              43rd St. Time Square cozy single bed
48894               Trendy duplex heart Hell 's Kitchen
Name: name, Length: 31362, dtype: object

In [27]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

sentiment_analyzer = SentimentIntensityAnalyzer()

def sentiment_score(string):
        result = sentiment_analyzer.polarity_scores(string)
        return result

In [28]:
sentiment = description.apply(sentiment_score)
sentiment

0        {'neg': 0.0, 'neu': 0.597, 'pos': 0.403, 'comp...
1        {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
2        {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
3        {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
5        {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
                               ...                        
48890    {'neg': 0.0, 'neu': 0.568, 'pos': 0.432, 'comp...
48891    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
48892    {'neg': 0.0, 'neu': 0.517, 'pos': 0.483, 'comp...
48893    {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...
48894    {'neg': 0.359, 'neu': 0.312, 'pos': 0.328, 'co...
Name: name, Length: 31362, dtype: object

VADER uses 4 main metrics to measure sentiment for words. Positive, negative, and neutral represent the proportion of text falling into these categories. The last metric is a sum of all the lexicon ratings that have been normalized between -1 (for most negative) to +1 (for most positive). A general rule is:
1. *positive sentiment*: compound score >= 0.05
2. *neutral sentiment*: -0.05 < compound score < 0.05
3. *negative sentiment*: compound score <= -0.05

Simply to experiment, I will try two different sentiment scoring methods:
1. sentiment value = compound score
2. sentiment value = {-1, 0 ,1} depending on polarity of the sentiment from the ranges above



In [29]:
# method 1
def compound_score(sent):
    return sent.get('compound')

sentiment_M1 = sentiment.apply(compound_score)
sentiment_M1

0        0.4019
1        0.0000
2        0.0000
3        0.0000
5        0.0000
          ...  
48890    0.5859
48891    0.0000
48892    0.4215
48893    0.0000
48894   -0.1027
Name: name, Length: 31362, dtype: float64

In [30]:
# method 2
def polarity(sent):
    compound = sent.get('compound')
    if(compound >= 0.05):
        return 1
    elif(compound <= -0.05):
        return -1
    return 0

sentiment_M2 = sentiment.apply(polarity)
sentiment_M2

0        1
1        0
2        0
3        0
5        0
        ..
48890    1
48891    0
48892    1
48893    0
48894   -1
Name: name, Length: 31362, dtype: int64

Now that we obtained the sentiment scores for the english titles, we have to obtain the average in each neighborhood and apply that average as the sentiment values for the non-english titles.

In [31]:
df.columns

Index(['name', 'neighbourhood_group', 'neighbourhood', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'reviews_per_month',
       'calculated_host_listings_count', 'availability_365'],
      dtype='object')

In [32]:
# creating temporary df
temporary = pd.DataFrame()
temporary['location'] = df['neighbourhood']
temporary['sent_M1'] = sentiment_M1.to_frame()
temporary['sent_M2'] = sentiment_M2.to_frame()
temporary['name'] = description.to_frame()
temporary

Unnamed: 0,location,sent_M1,sent_M2,name
0,Kensington,0.4019,1,Clean quiet apt home park
1,Midtown,0.0000,0,Skylit Midtown Castle
2,Harlem,0.0000,0,THE VILLAGE OF HARLEM .NEW YORK !
3,Clinton Hill,0.0000,0,Cozy Entire Floor Brownstone
5,Murray Hill,0.0000,0,Large Cozy 1 BR Apartment In Midtown East
...,...,...,...,...
48890,Bedford-Stuyvesant,0.5859,1,Charming one bedroom newly renovated rowhouse
48891,Bushwick,0.0000,0,Affordable room Bushwick East Williamsburg
48892,Harlem,0.4215,1,Sunny Studio Historical Neighborhood
48893,Hell's Kitchen,0.0000,0,43rd St. Time Square cozy single bed


In [33]:
# removing rows that are in not_en, with not_en_index (which was obtained in the same block of code as not_en)
temporary.drop(index = not_en_index, inplace = True)
temporary

Unnamed: 0,location,sent_M1,sent_M2,name
0,Kensington,0.4019,1,Clean quiet apt home park
1,Midtown,0.0000,0,Skylit Midtown Castle
2,Harlem,0.0000,0,THE VILLAGE OF HARLEM .NEW YORK !
3,Clinton Hill,0.0000,0,Cozy Entire Floor Brownstone
5,Murray Hill,0.0000,0,Large Cozy 1 BR Apartment In Midtown East
...,...,...,...,...
48890,Bedford-Stuyvesant,0.5859,1,Charming one bedroom newly renovated rowhouse
48891,Bushwick,0.0000,0,Affordable room Bushwick East Williamsburg
48892,Harlem,0.4215,1,Sunny Studio Historical Neighborhood
48893,Hell's Kitchen,0.0000,0,43rd St. Time Square cozy single bed


In [34]:
neighborhood_sent = temporary.groupby(['location']).mean()
neighborhood_sent

Unnamed: 0_level_0,sent_M1,sent_M2
location,Unnamed: 1_level_1,Unnamed: 2_level_1
Allerton,0.272953,0.500000
Arden Heights,0.048133,0.000000
Arrochar,0.123245,0.250000
Arverne,0.211664,0.375000
Astoria,0.259594,0.439394
...,...,...
Willowbrook,0.784500,1.000000
Windsor Terrace,0.361880,0.674157
Woodhaven,0.272691,0.507246
Woodlawn,-0.023111,0.000000


In [35]:
def polarity_range(score):
    if(score >= 0.05):
        return 1
    elif(score <= -0.05):
        return -1
    return 0


# obtaining our different sentiment scores
nhood_sent_M1 = neighborhood_sent['sent_M1']
nhood_sent_M2 = neighborhood_sent['sent_M2']

# applying our function to sent_M2 to turn it into -1, 0 and +1 only
nhood_sent_M2 = nhood_sent_M2.apply(polarity_range)

In [36]:
nhood_sent_M1

location
Allerton           0.272953
Arden Heights      0.048133
Arrochar           0.123245
Arverne            0.211664
Astoria            0.259594
                     ...   
Willowbrook        0.784500
Windsor Terrace    0.361880
Woodhaven          0.272691
Woodlawn          -0.023111
Woodside           0.204477
Name: sent_M1, Length: 218, dtype: float64

Now let's combine everything we have 

Now subbing these averages back into our original Series of sentiment scores

In [37]:
sent_m1 = sentiment_M1.copy()
sent_m2 = sentiment_M2.copy()

for index in not_en_index:
    sent_m1[index] = nhood_sent_M1[df['neighbourhood'][index]]
    sent_m2[index] = nhood_sent_M2[df['neighbourhood'][index]]

Now that we have all the required features, we can put our final dataset together and begin creating our models


<h3> Preparing Data

Let's take another look at our dataframe

In [38]:
df.head()

Unnamed: 0,name,neighbourhood_group,neighbourhood,room_type,price,minimum_nights,number_of_reviews,reviews_per_month,calculated_host_listings_count,availability_365
0,Clean & quiet apt home by the park,Brooklyn,Kensington,Private room,149,1,9,0.21,6,365
1,Skylit Midtown Castle,Manhattan,Midtown,Entire home/apt,225,1,45,0.38,2,355
2,THE VILLAGE OF HARLEM....NEW YORK !,Manhattan,Harlem,Private room,150,3,0,0.0,1,365
3,Cozy Entire Floor of Brownstone,Brooklyn,Clinton Hill,Entire home/apt,89,1,270,4.64,1,194
5,Large Cozy 1 BR Apartment In Midtown East,Manhattan,Murray Hill,Entire home/apt,200,3,74,0.59,1,129


Number_of_reviews and reviews_per_month does not give us much help since we don't know if the reviews were positive or negative. Because it is so ambiguous, we can drop these 2 columns

In [39]:
df.drop(columns = ['number_of_reviews',	'reviews_per_month'], inplace = True)
df.head()

Unnamed: 0,name,neighbourhood_group,neighbourhood,room_type,price,minimum_nights,calculated_host_listings_count,availability_365
0,Clean & quiet apt home by the park,Brooklyn,Kensington,Private room,149,1,6,365
1,Skylit Midtown Castle,Manhattan,Midtown,Entire home/apt,225,1,2,355
2,THE VILLAGE OF HARLEM....NEW YORK !,Manhattan,Harlem,Private room,150,3,1,365
3,Cozy Entire Floor of Brownstone,Brooklyn,Clinton Hill,Entire home/apt,89,1,1,194
5,Large Cozy 1 BR Apartment In Midtown East,Manhattan,Murray Hill,Entire home/apt,200,3,1,129


Dropping off all of our features that are not used or have been transformed.

In [40]:
df.drop(columns = ['name', 'neighbourhood_group', 'neighbourhood', 'room_type'], inplace = True)
df.head()

Unnamed: 0,price,minimum_nights,calculated_host_listings_count,availability_365
0,149,1,6,365
1,225,1,2,355
2,150,3,1,365
3,89,1,1,194
5,200,3,1,129


Extracting our target label

In [41]:
# creating a copy of df for later
df_copy = df.copy()
#################################
y = df['price']
df.drop(columns = ['price'], inplace = True)

Now adding our features from before. Two dataframes are created to account for both methods used to calculate sentiment

In [42]:
df_sent_m1 = df.copy()
df_sent_m2 = df.copy()

df_sent_m1['sentiment'] = sent_m1
df_sent_m2['sentiment'] = sent_m2


# adding onto our copy of df
df_copy['sent_m1'] = sent_m1
df_copy['sent_m2'] = sent_m2

Let's take a quick look at the correlation coefficient that price has with every other feature

In [43]:
corr_matrix_m1 = df_copy.corr()
corr_matrix_m1["price"].sort_values(ascending = False)

price                             1.000000
availability_365                  0.074509
calculated_host_listings_count    0.060828
minimum_nights                    0.039449
sent_m1                          -0.024784
sent_m2                          -0.027488
Name: price, dtype: float64

So there is a very low correlation coefficient with price and our other features (reaching even negative values for our sentiment values. This can indicate one of 2 things:
1. there is little to no relationship between price and the other features
2. the relationship between the other features is nonlinear to price since correlation coefficient is only a measure of linear relationship

In [63]:
object_cols

<31362x226 sparse matrix of type '<class 'numpy.float64'>'
	with 94086 stored elements in Compressed Sparse Row format>

In [44]:
# we will change df_sent_m1 and _m2 into sparse matrices as this data type get processed better
from scipy import sparse
#from scipy.sparse import hstack

X_m1 = sparse.hstack([object_cols, sparse.csr_matrix(df_sent_m1.to_numpy())])
X_m2 = sparse.hstack([object_cols, sparse.csr_matrix(df_sent_m2.to_numpy())])

<h4> Train/Test

Now splitting the data by using train/test

In [45]:
from sklearn.model_selection import train_test_split
X_train_m1, X_test_m1, y_train_m1, y_test_m1 = train_test_split(X_m1, y, test_size = 0.2, random_state = 42)
X_train_m2, X_test_m2, y_train_m2, y_test_m2 = train_test_split(X_m2, y, test_size = 0.2, random_state = 42)

y_train_m1 = y_train_m1.tolist()
y_train_m2 = y_train_m2.tolist()
y_test_m1 = y_test_m1.tolist()
y_test_m2 = y_test_m2.tolist()

To obtain more accurate results, normalization will be applied to the appropiate columns

In [46]:
temp_train_m1 = X_train_m1[:, X_train_m1.shape[1]-4:].toarray()
temp_train_m2 = X_train_m2[:, X_train_m2.shape[1]-4: X_test_m2.shape[1]-1].toarray()

temp_test_m1 = X_test_m1[:, X_test_m1.shape[1]-4:].toarray()
temp_test_m2 = X_test_m2[:, X_test_m2.shape[1]-4: X_test_m2.shape[1]-1].toarray()

from sklearn.preprocessing import StandardScaler
scaler_m1 = StandardScaler().fit(temp_train_m1)
scaler_m2 = StandardScaler().fit(temp_train_m2)

temp_train_m1 = scaler_m1.transform(temp_train_m1)
temp_train_m2 = scaler_m2.transform(temp_train_m2)

temp_test_m1 = scaler_m1.transform(temp_test_m1)
temp_test_m2 = scaler_m2.transform(temp_test_m2)

In [47]:
X_train_m1[:,X_train_m1.shape[1]-4:] = sparse.csr_matrix(temp_train_m1)
X_train_m2[:,X_train_m2.shape[1]-4: X_test_m2.shape[1]-1] = sparse.csr_matrix(temp_train_m2)

X_test_m1[:,X_test_m1.shape[1]-4: ] = sparse.csr_matrix(temp_test_m1)
X_test_m2[:,X_test_m2.shape[1]-4: X_test_m2.shape[1]-1] = sparse.csr_matrix(temp_test_m2)

  self._set_arrayXarray_sparse(i, j, x)


Now that we have prepared our train and test data, we can begin to create models

<h3> Model Building

The models that we will try are:
1. SVM
2. Lasso Regression
3. Ridge Regression
4. Linear Regression
5. Random Forests
6. Neural network

Cross validation with stratified k fold will be used to determine the most appropiate model.

In [82]:
from sklearn.svm import SVR
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from keras.models import Sequential
from keras.layers import Dense
from keras import regularizers
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import KFold
from keras.initializers import RandomNormal

#### FOR NEURAL NETWORK ####
initializer = RandomNormal(mean=0., stddev=1.)

def neural_net():
    model = Sequential()
    model.add(Dense(int(2/3 * X_test_m1.shape[1]), kernel_initializer = initializer, activation = 'relu', input_dim = X_test_m1.shape[1]))
    model.add(Dense(int(4/9 * X_test_m1.shape[1]), kernel_initializer = initializer, activation = 'relu'))
    model.add(Dense(1, kernel_initializer = 'normal', activation = 'relu'))
    model.compile(optimizer = 'SGD', loss = 'mse', metrics = ['mae'])
    return model
############################    

models = []
models += [['SVM', SVR(kernel = 'linear')]]
models += [['Lasso', Lasso(alpha = 0.9, normalize = False, selection = 'cyclic')]]
models += [['Ridge', Ridge(alpha = 0.9, normalize = False, solver = 'auto')]]
models += [['Linear', LinearRegression(normalize = False)]]
models += [['Random Forests', RandomForestClassifier(n_estimators = 100, max_features = X_test_m1.shape[1], random_state = 42, max_depth = 9)]]

# for the k fold cross validation
kfold = KFold(n_splits = 10, random_state = 1, shuffle = True)

Performing cross validation

<h4> Method 1

In [83]:
from sklearn.model_selection import cross_val_score

result_m1 =[]
names = []

for name, model in models:
    cv_score = -1 * cross_val_score(model, X_train_m1, y_train_m1, cv = kfold, scoring = 'neg_mean_absolute_error')
    result_m1 +=[cv_score]
    names += [name]
    print('%s: %f (%f)' % (name,cv_score.mean(), cv_score.std()))

SVM: 66.049805 (4.989095)
Lasso: 78.135708 (4.323603)
Ridge: 77.659568 (4.610512)
Linear: 77.758034 (4.615779)
Random Forests: 66.762903 (5.114323)


In [84]:
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

estimator = KerasRegressor(build_fn = neural_net, epochs = 1000, batch_size = 2000, verbose=0)
results = -1 * cross_val_score(estimator, X_train_m1, y_train_m1, cv = kfold, scoring = 'neg_mean_absolute_error')
print("Neural net: %f (%f) MSE" % (results.mean(), results.std()))

Neural net: 160.911656 (5.378138) MSE


<h4> Method 2

In [85]:
result_m2 =[]
names = []

for name, model in models:
    cv_score = -1 * cross_val_score(model, X_train_m2, y_train_m2, cv = kfold, scoring = 'neg_mean_absolute_error')
    result_m2 +=[cv_score]
    names += [name]
    print('%s: %f (%f)' % (name,cv_score.mean(), cv_score.std()))

SVM: 66.046219 (4.999490)
Lasso: 78.114655 (4.341362)
Ridge: 77.638274 (4.622443)
Linear: 77.740965 (4.629480)
Random Forests: 66.761026 (5.112784)


In [86]:
estimator = KerasRegressor(build_fn = neural_net, epochs = 1000, batch_size = 2000, verbose=0)
results = -1 * cross_val_score(estimator, X_train_m2, y_train_m2, cv = kfold, scoring = 'neg_mean_absolute_error')
print("Neural net: %f (%f) MSE" % (results.mean(), results.std()))

Neural net: 160.911656 (5.378138) MSE


It seems that a linear SVM has the lowest mean score, and interestingly the 2nd method of using strict values of +1, 0 or -1 for the sentiment resulted in marginally better results for all models.

<h3> Final Evaluation

So our final chosen model will be a linear SVM, using the 2nd method for sentiment

In [87]:
model = SVR(kernel = 'linear').fit(X_train_m2, y_train_m2)

# obtaining predictions
predictions = model.predict(X_test_m2)

Evaluating our model

In [89]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score
print("Final mean squared error: %f      R^2 value: %f" %(mean_absolute_error(y_test_m2, predictions), r2_score(y_test_m2, predictions)))

Final mean squared error: 70.950375      R^2 value: 0.062358


This is a very high MSE and has a very poor fit to our data.  

**NOTE**: To be quite honest, tweaking hyperparameters is something that I am still not entirely comfortable doing, so it will be skipped here for now. Will pick this notebook back up once I learn more. 