## 1. Getting Started: Airbnb Copenhagen

This assignment deals with the most recent Airbnb listings in Copenhagen. The data is collected from [Inside Airbnb](http://insideairbnb.com/copenhagen). Feel free to explore the website further in order to better understand the data. The data (*listings.csv*) has been collected as raw data and needs to be preprocessed.

**Hand-in:** Hand in as a group in Itslearning in a **single**, well-organized and easy-to-read Jupyter Notebook. If your group consists of students from different classes, upload in **both** classes.

1. First we need to remove all the redundant columns. Please keep the following 22 columns and remove all others:

    id\
    name  
    host_id  
    host_name  
    neighbourhood_cleansed  
    latitude  
    longitude  
    room_type  
    price  
    minimum_nights  
    number_of_reviews  
    last_review  
    review_scores_rating  
    review_scores_accuracy  
    review_scores_cleanliness  
    review_scores_checkin  
    review_scores_communication  
    review_scores_location  
    review_scores_value  
    reviews_per_month  
    calculated_host_listings_count  
    availability_365



2. Next we have to handle missing values. Remove all rows where `number_of_reviews = 0`. If there are still missing values, remove the rows that contain them so you have a data set with no missing values.

3. Fix the `neighbourhood_cleansed` values (some are missing 'æ ø å'), and if necessary change the price to DKK.

4. Create a fitting word cloud based on the `name` column. Feel free to remove non-descriptive stop words (e.g. since this is about Copenhagen, perhaps the word 'Copenhagen' is redundant).

5. Since data science is so much fun, provide a word cloud of the names of the hosts, removing any names of non-persons. Does this more or less correspond with the distribution of names according to [Danmarks Statistik](https://www.dst.dk/da/Statistik/emner/borgere/navne/navne-i-hele-befolkningen)?

6. Create a new column using bins of price. Use 11 bins, evenly distributed but with the last bin $> 10,000$.

7. Using non-scaled versions of latitude and longitude, plot the listings data on a map. Use the newly created price bins as a color parameter. Also, create a plot (i.e. another plot) where you group the listings with regard to the neighbourhood.

8. Create boxplots where you have the neighbourhood on the x-axis and price on the y-axis. What does this tell you about the listings in Copenhagen? Keep the x-axis as is and move different variables into the y-axis to see how things are distributed between the neighborhoods to create different plots (your choice).

9. Create a bar chart of the hosts with the top ten most listings. Place host id on the x-axis and the count of listings on the y-axis.

10. Do a descriptive analysis of the neighborhoods. Include information about room type in the analysis as well as one other self-chosen feature. The descriptive analysis should contain mean/average, mode, median, standard deviation/variance, minimum, maximum and quartiles.

11. Supply a list of the top 10 highest rated listings and visualize them on a map.

12. Now, use any preprocessing and feature engineering steps that you find relevant before proceeding (optional).

13. Create another new column, where the price is divided into two categories: "expensive" listings defined by all listings with a price higher than the median price, and "affordable" listings defined by all listings with a price equal to or below the median price. You can encode the affordable listings as "0" and the expensive ones as "1". All listings should now have a classification indicating either expensive listings (1) or affordable listings (0).

14. Based on self-chosen features, develop a Naïve Bayes and k-Nearest Neighbor model to determine whether a rental property should be classified as 0 or 1. Remember to divide your data into training data and test data. Comment on your findings.

15. Try to come up with a final conclusion to the Airbnb-Copenhagen assignment.


Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

First we need to remove all the redundant columns. Please keep the following 22 columns and remove all others


Next we have to handle missing values. Remove all rows where `number_of_reviews = 0`. If there are still missing values,
remove the rows that contain them so you have a data set with no missing values.

Fix the `neighbourhood_cleansed` values (some are missing 'æ ø å'), and if necessary change the price to DKK.



In [None]:
df = pd.read_csv('listings.csv')
df = df[['id', 'name', 'host_id', 'host_name', 'neighbourhood_cleansed', 'latitude', 'longitude', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365']]
df = df[df['number_of_reviews'] != 0]
df = df.dropna()
df['neighbourhood_cleansed'] = df['neighbourhood_cleansed'].str.replace('æ', 'ae')
df['neighbourhood_cleansed'] = df['neighbourhood_cleansed'].str.replace('ø', 'oe')
df['neighbourhood_cleansed'] = df['neighbourhood_cleansed'].str.replace('å', 'aa')
df['price'] = df['price'].str.replace('$', '')
df['price'] = df['price'].str.replace(',', '')
df['price'] = df['price'].astype(float)
df['price'] = df['price']

Create a fitting word cloud based on the `name` column. Feel free to remove non-descriptive stop words (e.g. since this is about Copenhagen, perhaps the word 'Copenhagen' is redundant).

In [None]:
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
stopwords.add('copenhagen')
stopwords.add('dtype')
stopwords.add('og')
stopwords.add('fra')
stopwords.add('N')
stopwords.add('ude')
stopwords.add('Name')
wordcloud = WordCloud(
    background_color='white',
    stopwords=stopwords,
    max_words=200,
    max_font_size=200,
    random_state=42
).generate(str(df['name']))
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()

Since data science is so much fun, provide a word cloud of the names of the hosts, removing any names of non-persons. Does this more or less correspond with the distribution of names according to [Danmarks Statistik](https://www.dst.dk/da/Statistik/emner/borgere/navne/navne-i-hele-befolkningen)?

In [None]:
stopwords = set(STOPWORDS)
stopwords.add('object')
stopwords.add('host_name')
stopwords.add('dtype')
stopwords.add('length')
stopwords.add('Name')

wordcloud = WordCloud(
    background_color='white',
    stopwords=stopwords,
    max_words=200,
    max_font_size=200,
    random_state=42
).generate(str(df['host_name']))
fig = plt.figure(1)
plt.imshow(wordcloud)
plt.axis('off')



Create a new column using bins of price. Use 11 bins, evenly distributed but with the last bin $> 10,000$.

In [None]:
bin_edges = list(range(0, 10001, 1000)) + [float('inf')]
bin_labels = ['0-1000', '1000-2000', '2000-3000', '3000-4000', '4000-5000', '5000-6000', '6000-7000', '7000-8000', '8000-9000', '9000-10000', '10000+']
df['price_bins'] = pd.cut(df['price'], bins=bin_edges, labels=bin_labels)
value_counts = df['price_bins'].value_counts().loc[bin_labels]
value_counts.plot(kind='bar', figsize=(10, 6), title='Price distribution')
plt.xlabel('Price')
plt.ylabel('Count')
plt.show()

In [None]:
# show  the distribution of ln(price)
df['ln_price'] = np.log(df['price'])
df['ln_price'].hist(bins=11)
plt.xlabel('ln(price)')
plt.ylabel('Count')
plt.show()

In [None]:
bin_edges = list(range(0, 11, 1)) + [float('inf')]
bin_labels = ['0-1', '1-2', '2-3', '3-4', '4-5', '5-6', '6-7','7-8','8-9','9-10','10+']
df['ln_price_bins'] = pd.cut(df['ln_price'], bins=bin_edges, labels=bin_labels)

Using non-scaled versions of latitude and longitude, plot the listings data on a map. Use the newly created price bins as a color parameter.

In [None]:
import folium

df_map = df[['latitude', 'longitude', 'ln_price']].dropna().reset_index(drop=True)

m = folium.Map(location=[55.6761, 12.5683], zoom_start=11)

def gradient_rgb_color(price, min_price, max_price):
    normalized = (price - min_price) / (max_price - min_price)
    red = int(255 * normalized)
    green = int(255 * (1 - normalized))
    blue = 0
    
    return "#{:02x}{:02x}{:02x}".format(red, green, blue)


min_price = df_map['ln_price'].min()
max_price = df_map['ln_price'].max()

for i in range(0, len(df_map)):
    price = df_map.iloc[i]['ln_price']
    color = gradient_rgb_color(price, min_price, max_price)
    folium.Circle(
        location=[df_map.iloc[i]['latitude'], df_map.iloc[i]['longitude']],
        popup=df_map.iloc[i]['ln_price'],
        radius=10,
        color=color,
        fill=True,
        fill_color=color
    ).add_to(m)
m

Also, create a plot (i.e. another plot) where you group the listings with regard to the neighbourhood.

In [None]:
df['neighbourhood_cleansed'].value_counts().plot(kind='bar', figsize=(10, 6), title='Neighbourhood distribution')
plt.xlabel('Neighbourhood')
plt.ylabel('Count')
plt.show()


Create boxplots where you have the neighbourhood on the x-axis and price on the y-axis. What does this tell you about the listings in Copenhagen? Keep the x-axis as is and move different variables into the y-axis to see how things are distributed between the neighborhoods to create different plots (your choice).

In [None]:

boxplot_neighbourhood_price = sns.boxplot(x='neighbourhood_cleansed', y='price', data=df)
boxplot_neighbourhood_price.set_xticklabels(boxplot_neighbourhood_price.get_xticklabels(), rotation=90)
plt.show()


Create a bar chart of the hosts with the top ten most listings. Place host id on the x-axis and the count of listings on the y-axis.

In [None]:
barplot_host_id = sns.countplot(x='host_id', data=df, order=df['host_id'].value_counts().iloc[:10].index)
barplot_host_id.set_xticklabels(barplot_host_id.get_xticklabels(), rotation=90)
barplot_host_id.set(xlabel='Host ID', ylabel='Count of Listings')
plt.show()

In [None]:
df_calculated_host_listings_count = df.sort_values(by=['calculated_host_listings_count'], ascending=False)
df_calculated_host_listings_count = df_calculated_host_listings_count.drop_duplicates(subset='host_id', keep='first')
df_calculated_host_listings_count = df_calculated_host_listings_count.head(10)
df_calculated_host_listings_count = df_calculated_host_listings_count[['host_id', 'calculated_host_listings_count']]

barplot_calculated_host_listings_count = sns.barplot(x='host_id', y='calculated_host_listings_count', data=df_calculated_host_listings_count,
                                                    order=df_calculated_host_listings_count.sort_values(by='calculated_host_listings_count', ascending=False)['host_id'])
barplot_calculated_host_listings_count.set_xticklabels(barplot_calculated_host_listings_count.get_xticklabels(), rotation=90)
barplot_calculated_host_listings_count.set(xlabel='Host ID', ylabel='Count of Listings')
plt.show()

Do a descriptive analysis of the neighborhoods. Include information about room type in the analysis as well as one other self-chosen feature. The descriptive analysis should contain mean/average, mode, median, standard deviation/variance, minimum, maximum and quartiles.

In [None]:
neighbourhood = df.groupby('neighbourhood_cleansed')
neighbourhood['room_type'].describe()

In [None]:
neighbourhood['price'].describe()

In [None]:
neighbourhood['minimum_nights'].describe()

In [None]:
print(neighbourhood['price'].quantile([0.25, 0.5, 0.75]))

Supply a list of the top 10 highest rated listings and visualize them on a map.

In [None]:
df_top10_highest = df[['latitude', 'longitude', 'price', 'price_bins', 'review_scores_rating']].dropna().reset_index(drop=True).sort_values(by='review_scores_rating', ascending=False).head(10)
m = folium.Map(location=[55.6761, 12.5683], zoom_start=11)
marker_cluster = MarkerCluster().add_to(m)
for _, row in df_top10_highest.iterrows():
    color = categorize(row['price'])
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=5,
        color=color,
        fill=True,
        fill_color=color,
        popup=f"Price: {row['price']} - Bin: {row['price_bins']}"
    ).add_to(marker_cluster)
m

Now, use any preprocessing and feature engineering steps that you find relevant before proceeding (optional).

In [None]:
df_preprocessing = df[['neighbourhood_cleansed', 'room_type', 'price', 'minimum_nights']]
df_preprocessing = pd.concat([df_preprocessing, pd.get_dummies(df_preprocessing['room_type'])], axis=1)
df_preprocessing = pd.concat([df_preprocessing, pd.get_dummies(df_preprocessing['neighbourhood_cleansed'])], axis=1)
df_preprocessing = df_preprocessing.drop(['neighbourhood_cleansed', 'room_type'], axis=1)

Create another new column, where the price is divided into two categories: "expensive" listings defined by all listings with a price higher than the median price, and "affordable" listings defined by all listings with a price equal to or below the median price. You can encode the affordable listings as "0" and the expensive ones as "1". All listings should now have a classification indicating either expensive listings (1) or affordable listings (0).

In [None]:
df_preprocessing['expensive'] = np.where(df_preprocessing['price'] > df_preprocessing['price'].median(), 1, 0)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

Based on self-chosen features, develop a Naïve Bayes and k-Nearest Neighbor model to determine whether a rental property should be classified as 0 or 1. Remember to divide your data into training data and test data. Comment on your findings.

In [None]:
y = df_preprocessing['expensive']
X = df_preprocessing.drop(['expensive'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print('Accuracy score for Naive Bayes: ', accuracy_score(y_test, y_pred))
print('AUC score for Naive Bayes: ', roc_auc_score(y_test, y_pred))
scores = cross_val_score(gnb, X, y, cv=5, scoring='roc_auc')
print('Cross validation scores for Naive Bayes: ', scores)
print('Mean cross validation score for Naive Bayes: ', scores.mean())

In [None]:
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print('Accuracy score for KNN: ', accuracy_score(y_test, y_pred))
print('AUC score for KNN: ', roc_auc_score(y_test, y_pred))
scores = cross_val_score(knn, X, y, cv=5, scoring='roc_auc')
print('Cross validation scores for KNN: ', scores)
print('Mean cross validation score for KNN: ', scores.mean())

In [None]:
#try XGBoost classifier
import xgboost as xgb

xgb_model = xgb.XGBClassifier(random_state=42)
xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)
print('Accuracy score for XGBoost: ', accuracy_score(y_test, y_pred))
print('AUC score for XGBoost: ', roc_auc_score(y_test, y_pred))
scores = cross_val_score(xgb_model, X, y, cv=5, scoring='roc_auc')
print('Cross validation scores for XGBoost: ', scores)
print('Mean cross validation score for XGBoost: ', scores.mean())

Try to come up with a final conclusion to the Airbnb-Copenhagen assignment.