# AI@Penn Venture Fellows Spring 2021 Data Challenge

## Instructions
As a part of your application to the AI@Penn Venture Fellows program, you are required to complete a data challenge. The data challenge is designed to understand your thought process when working with a data/ML oriented problem. You are given data on listings from Airbnb (found here: https://www.kaggle.com/kritikseth/us-airbnb-open-data) and your task is to create a 3-5 page presentation (to be submitted as a PDF) outlining your findings, analysis, and any recommendations. 
The topic and structure of your analysis is fully up to you. Potential areas for investigation are as follows: 
1. Build a model to predict the price of an Airbnb listing 
a. How accurate is your model? What characteristics drive a higher price? 
2. Give a deeper insight into Airbnb listings 
a. Exploratory analysis to understand the mix and characteristics of Airbnb listings 

Once you have completed the challenge, please submit your code and presentation on the application form. This challenge is due along with the application on 12th February 2021.



## Introduction

With the given instructions, I will be working toward a mix of suggested areas of investigation 1 & 2, where I will attempt to identify features which have a stronger influence on Price than others (using some exploratory analysis), and use these features to build a model to predict price. Lastly I will look at my model's accuracy and write a brief summary

### **1. Importing Libraries and loading the dataset:**
+ I'm also on a kaggle notebook which comes with infrastructure to directly load datasets in from within kaggle.


In [None]:
import numpy as np
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import folium

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


In [None]:
data = pd.read_csv('../input/us-airbnb-open-data/AB_US_2020.csv')

### 2. Taking a basic view of the data
At face value, we have a number of columns, both containing certain features as well as identifiers (host id, name, etc). Since we are attempting to identify features that correlate with price, I am just going to remove all these identifies but one -- we'll keep id to mark rows as distinct, but we dont need much more (host id/ host name/ name) beyond that

In [None]:
data.head()

In [None]:
data.drop(columns = ['host_id', 'host_name', 'name'])


### 3. Separating features based on type

For the purposes of being more systematic, we'll break up our features into sets of numerical features and categorical ones, and look at them differently

In [None]:
categorical_data = data[['id', 'neighbourhood_group', 'room_type', 'city', 'price']]
numerical_data = data[['id', 'latitude', 'longitude', 'neighbourhood',  'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month'\
                       , 'calculated_host_listings_count', 'availability_365', 'price']]

### 4. Categorical Features

Lets look at our categorical features! We'll be working with them first, since they are significantly smaller in terms of the number of features

In [None]:
categorical_data.head()

#### 4.1 Room Type

Let's start with room types, since that seems simple. At an intuitive level, this should defeinitely have a reasonable impact on price, especially since the type of room has large effects on other factors, such as space, comfort, etc. 

In the next few cells, I will do the following:

+ See if this column contains any null values, and deal with those if necessary.
+ Group the data by room_type based on their average price per type
+ Visualize this relationship with a simple bar chart (here I will note that skyblue is one of the best colors in plt although no one asked)

In [None]:
categorical_data['room_type'].isnull().sum()

In [None]:
room_type_ave = categorical_data.groupby('room_type', as_index=False).mean()[['room_type', 'price']]
room_type_ave

In [None]:
plt.figure(figsize=(12,6))
sns.barplot('room_type', 'price', data=categorical_data, color='skyblue', edgecolor='steelblue')
plt.title('Price based on Room Type')

We can clearly see here, as predicted, that the type of room seems to have an effect on the price of that room.

#### 4.2 City

Now lets see if the city has a noticeable effect on price. But before that, lets identify how dense the number of entries per city are. Then we can visualize both of these relationships

In [None]:
a,b = plt.subplots(1,2, figsize=(16,8))
city_count = categorical_data.groupby('city', as_index=False).size()
city_count.plot.bar(x='city', y='size', ax=b[0], ylabel='Size')
city_ave = categorical_data.groupby('city', as_index=False).mean()[['city', 'price']]
city_ave.plot.bar(x='city', y='price', ax=b[1], color='orange', ylabel='Price')

#### 4.3 Neighbourhood Group

Honestly this feature is pretty similar to the city of the listings, but it seems from a first view of the data that this column will present a problem in terms of null entries. So let's test this theory out

In [None]:
categorical_data['neighbourhood_group'].isnull().sum() / len(categorical_data['neighbourhood_group'])

And well we are right here, because it turns out that over half the entries in our dataset don't contain a value for 'neighbourhood group'. We can look at the relationship it shares with price, but because of its similarity to the city feature (they both represent some facet of location albeit a city is larger), we will likely use the cities of listings over neighbourhood groups in our model since it contains more values to train our model


In [None]:
#dropping null values
neighbourhood_group_dropna = categorical_data.dropna(subset=['neighbourhood_group']).\
groupby('neighbourhood_group', as_index=False).mean()[['neighbourhood_group', 'price']]
neighbourhood_group_dropna.plot(kind='bar', figsize=(12,6))

## 5. Numerical Features

Lets begin off the bat by using a correlation matrix, to see at first glance if there are any strong relationships between numeric features and price

In [None]:
numerical_data.head()

In [None]:
numerical_data.describe()

In [None]:
plt.figure(figsize=(12,6))
sns.heatmap(numerical_data.corr(),cmap='YlGnBu')


It doesn't seem that there are any features that stick out immediately in terms of their relationship with price, which is fine. We will look at these factors individually anyway

#### 5.1 Minimum Nights
Let's start by looking at minimum nights. Lets first look at the spread of the data. Then, I'll group up values of the min nights column into 10 bins, to make visualization easier

In [None]:
min_nights_spread=numerical_data.groupby('minimum_nights', as_index = False).count()[['minimum_nights', 'price']]
sns.scatterplot(x = 'minimum_nights', y='price', data=min_nights_spread)

In [None]:
#removing outlier 
drop_set = numerical_data[numerical_data['minimum_nights'] > 10**7]
numerical_data = numerical_data.drop(drop_set.index)

In [None]:
min_nights_ave_count = numerical_data.groupby('minimum_nights', as_index = False).count()
min_nights_ave = numerical_data.groupby('minimum_nights', as_index = False).mean()
numerical_data['min_nights_ave_bins']= pd.qcut(min_nights_ave['minimum_nights'], q=10, precision=6)
numerical_data['min_nights_ave_bins'] = numerical_data['min_nights_ave_bins'].apply(lambda x: pd.Interval(left=int(round(x.left)), right=int(round(x.right))))
numerical_data

#numerical_data.groupby('min_nights_ave_bins', as_index=False).mean()

In [None]:
plt.figure(figsize=(12,6))
ax = sns.barplot('min_nights_ave_bins', 'price', data=numerical_data, color='palegreen', edgecolor='green')
#ax.set_xticklabels(['<18','<35','<52','<83','<104','<153', '<200','<300','<396','<100000000'])
ax.set_xlabel('Minimum Nights')

There does not seem to be a noticeable trend/ relationship in the data, but we can see that rooms which have minimum nights between ~30 and ~50 days have a higher price on average.


#### 5.2 Latitude/Longitude (and neighbourhood)

Lets look at how location in terms of latitudes/longitudes affects price. Then we will look at how neighbourhoods compare to latitudes/longitudes in terms of their relationship with price

In [None]:

from sklearn.cluster import KMeans

km = KMeans().fit(numerical_data[['latitude', 'longitude']])
km.cluster_centers_

our_map = folium.Map([42, -110], zoom_start=4)

for i in range(km.cluster_centers_.shape[0]):
    total = sum(km.labels_ == i)
    folium.CircleMarker([km.cluster_centers_[i,0], km.cluster_centers_[i,1]], \
                        popup = '('+ str(round(km.cluster_centers_[i,0])) + ', ' + str(round(km.cluster_centers_[i,1])) +') : ' +\
                        str(total) + ' listings here', radius = 15, fill_color='blue').add_to(our_map)

our_map

We can definitely see discernible clusters of listings from our data through this visualization.

#### 5.3 Reviews Per Month
How do revies per month affect the average price?

In [None]:
plt.figure(figsize=(10,6))
reviews_per_m = numerical_data.groupby('reviews_per_month', as_index=False).mean()[['reviews_per_month', 'price']]
sns.scatterplot('reviews_per_month', 'price', data=reviews_per_m)

There also doesn't seem to be any discernible relationship between the number of reviews per month, and price.

#### 5.4 Number of Reviews
Now that we've seen the average number of views per month, we'll look at the total number of reviews next.

In [None]:
plt.figure(figsize=(10,6))
reviews_total = numerical_data.groupby('number_of_reviews', as_index=False).mean()
sns.regplot('number_of_reviews', 'price', data=reviews_total, color='skyblue', line_kws={"color": "blue"})

There seems to be a general negative correlation between the total number of reviews and price! It is unclear which factor is at the cause here, or whether neither of them are. We'll keep this feature.

#### 5.5 Availability
Next, we'll look at availability

In [None]:
plt.figure(figsize=(10,6))
availability = numerical_data.groupby('availability_365', as_index=False).mean()
sns.regplot('availability_365', 'price', data=availability, color='skyblue', line_kws={"color": "blue"})


We seem to have received a positive correlation here, which is the flip of our earlier feature. A larger availability seems to relate to a higher price. We will definitely keep this feature

#### 5.6 Calculated Host Listings Count 
Our final feature

In [None]:
calc_host_listings = numerical_data.groupby('calculated_host_listings_count', as_index=False).mean()
sns.regplot('calculated_host_listings_count', 'price', data=calc_host_listings, color='skyblue', line_kws={"color": "blue"})


There doesnt seem to be a noticeable relationship here, and for that reason we'll leave this feature out.

### 6. Building a model

I'll be using A Linear Regression Model for this. In the following cells:
+ Split the data into train/test
+ Encode Categorical Data
+ Fit the data on a Linear Regressor 
+ Predict y_test values
+ Calculate the accuracy of the model

In [None]:
features = data[['room_type', 'minimum_nights', 'number_of_reviews', 'city', 'availability_365', 'latitude', 'longitude']]
y = data[['price']]

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder


In [None]:
l = LabelEncoder()
features['room_type_enc'] = l.fit_transform(features['room_type'])
features['city_enc'] = l.fit_transform(features['city'])
features = features.drop(columns=['city', 'room_type'])


In [None]:
x_train, x_test, y_train, y_test = train_test_split(features, y, test_size = 0.4, 
                                                    random_state = 2)


In [None]:
regressor = LinearRegression()
regressor.fit(x_train, y_train)

In [None]:
y_pred = regressor.predict(x_test)

In [None]:
act_v_pred = pd.DataFrame(y_test).reset_index()
act_v_pred['Predicted'] = y_pred
act_v_pred.head(10)

In [None]:
mean_squared_error(y_pred, y_test, squared=False)