Read in the necessary libraries

In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns

Read in the data to be analyzed

In [53]:
df_detailed_listings = pd.read_csv('./resource/detailed_listings.csv')

In [54]:
# df_listing_cleaned = df_detailed_listings.drop(columns=['listing_url', 'scrape_id', 'last_scraped', 'source', 'picture_url', 'host_url', \
#     'host_thumbnail_url', 'host_picture_url', 'latitude', 'longitude', 'calendar_updated', 'calendar_last_scraped', 'license'], 
#     axis=1).copy()
# df_listing_cleaned = df_listing_cleaned.dropna(axis=1, how='all')
# df_listing_cleaned['price'] = df_listing_cleaned.price.str[1:].str.replace(',','').str.split('.').str[0].astype(int)

Here we try to train a model to predict the price of a listing. Therefor we need to prepare the data further:

1) Drop all features which doesn't contain useful data for our model, such as URLs, dates and coordinates.
2) Drop all entries with missing values for the respondent, in this case the price.
3) If there are missing values for numerical features we fill them with the mean.
4) For the categorical values we need to implement dummy variables

In [55]:
# Drop all features which doesn't contain useful data for our model, such as URLs, dates and coordinates.
df = df_detailed_listings.drop(columns=['listing_url', 'scrape_id', 'last_scraped', 'source', 'picture_url', 'host_url', \
    'host_thumbnail_url', 'host_picture_url', 'latitude', 'longitude', 'calendar_updated', 'calendar_last_scraped', 'license'], 
    axis=1).copy()

In [56]:
# modify the price column and change it into a usable integer datatype:
df['price'] = df.price.str[1:].str.replace(',','').str.split('.').str[0].astype(int)

# drop all entries with missing values for the respondent, in this case the price.
df = df.dropna(subset=['price'], axis=0, how='all').copy()

In [65]:
print(f'Numerical features in the dataset: \n{list(df.select_dtypes(["int","float"]).columns)}')
print(f'Categorical features in the dataset: \n{list(df.select_dtypes("object").columns)}')

Numerical features in the dataset: 
['id', 'host_id', 'host_listings_count', 'host_total_listings_count', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms', 'reviews_per_month']
Categorical features in the dataset: 
['name', 'description', 'neighborhood_overview', 'host_name', 'host_since', 

After prepping the data we split the dataset into the X Matrix and the respondent y, and further into sub-datasets used for training and testing the model.