# CIS5026 WRIT1 - How do words in listing names affect listing prices?

#### The aim of this project is to research if there is a difference in the most commonly used words in listing names at the high and low end of the price spectrum. For properties that are well reviewed, we could advise users of popular listing words in the luxury and budget ranges We don't have listing reviews so this would be future work. This is just an exploratory analysis of commonly used listing words. but could try and link it to availability.  

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re

## Pre-processing and exploratory data analysis

In [None]:
# Read in the data and check structure, column headers and data types
df = pd.read_csv('AB_NYC_2019.csv')
print(df.head())
df.dtypes
df.shape

In [None]:
# Check if null values
df.info()

In [None]:
# Some nulls, so count nulls per column
df.isnull().sum()

We aren't going to use either review columns in this analysis and there's no way to impute for a listing name, so we'll drop the columns with no name entry. 

In [None]:
df.dropna(subset=['name'], inplace=True)
df.isnull().sum()

In [None]:
# Get the basic descriptive stats
df.describe()

Huge variation in price and the minimum nights for a booking. Interesting that price has a 0 entry without missing any values, let's check them out.

In [None]:
df.loc[df['price']== 0]

Nothing in the listing description to suggest that the stay is for free, some of these are also from the same users so will assume it is some clerical error. We'll remove these rows:

In [None]:
df = df[df.price != 0]

Next let's look at the spread of properties across the different neighbourhood groups. 

In [None]:
df['neighbourhood_group'].hist()

Brooklyn and Manhattan have significantly more listings than other areas. The Bronx and especially Staten Island have little representation in comparison. Let's see how the price differs across neighbourhoods and room type.

In [None]:
df.groupby(['room_type','neighbourhood_group'])['price'].describe()

In [None]:
# And plot it to get a better look 
plt.figure(figsize=(16, 7))
sns.barplot(df.neighbourhood_group, df.price, hue=df.room_type, palette="colorblind", ci = None)

Manhattan listings are significantly more expensive than other neighbourhoods. Not so much variation between prices in other neighbourhoods. Of the other 4 areas; it's interesting that shared rooms in Brooklyn are the least expensive while entire properties are the most expensive. It's possible that listings in Brooklyn, Queens and Bronx could get more expensive as they approach the Manhattan border. This could be future work to look into. Also look at availability 365 - bookings with less availability could be a good price so get booked up and vice versa. 

In [None]:
# Top and bottom rows were cut off, this is because Matplotlib 3.1.1 breaks Seaborn 0.9.0 - not sure why those versions are used
# Manually setting ylim solves issue
plt.figure(figsize=(10, 8))
plt.title("Variable Correlation Plot")
ax = sns.heatmap(data = df.corr(), fmt = '.2f', annot=True, cmap='magma', vmin=0, vmax=1)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
plt.show()

There doesn't seem to be much strong correlation, positive or negative, between variables that might be interesting to look into. The number of reviews being positively correlated to reviews per month is expected; as reviews for a month increases so will the total number of reviews. The same goes for the host and listing IDs; hosts are likely to retain ownership of a listing for multiple bookings so the IDs will be associated. 

## Data Analysis

#### We'll now explore most commonly used words in listings and analyse the difference across different price bands.

In [None]:
# Import specific functions from nltk, functions not recognised despite importing nltk at start
from nltk.corpus import stopwords
from nltk import word_tokenize
# Changes name column from object to string and join, drop words less than 3 characters
to_string = "".join(str(i) for i in df['name'] if len(i)>2)
# Tokenize all words in name field
word_tokens = word_tokenize(to_string)
# Change everything to lower case
lower_words = str(word_tokens).lower().split()
# Handles punctuation
just_words = re.sub('[^a-zA-Z]', ' ', str(lower_words)).split()
# Handle stopwords e.g. to, an, the, etc.
sw = stopwords.words('english')
# Append the words not in stopwords into a list
wordlist = []
for token in just_words: 
    if token not in sw: 
        wordlist.append(token)
# Store words in a series
words = pd.Series(wordlist)

In [None]:
common_words = words.value_counts()[:20]
common_words