# Some fancy Title

## Goal of the project

-) Is there any connection between the location of an AirBnb and Points of interests around it, to the location rating
-) How well can a new location of an AirBnb be predicted

## Data sets

### listings.csv

Source: http://data.insideairbnb.com/austria/vienna/vienna/2022-09-11/data/listings.csv.gz

This data set contains information about listed AirBnb apartments/rooms in Vienna on the 12th of October 2022. This data set was chosen as it contains location data as well as review ratings for each listed apartment/room.

TODO: Describe data set in more detail

### LANDESGRENZEOGD.csv

Source: https://data.wien.gv.at/daten/geo?service=WFS&request=GetFeature&version=1.1.0&typeName=ogdwien:LANDESGRENZEOGD&srsName=EPSG:4326&outputFormat=csv

This data set contains geolocation data about the borders of the Austrian federal states. This data set was chosen, to get the exact borders of Vienna. This was especially useful for plotting.

### HALTESTELLEWLOGD.csv
Source: https://data.wien.gv.at/daten/geo?service=WFS&request=GetFeature&version=1.1.0&typeName=ogdwien:HALTESTELLEWLOGD&srsName=EPSG:4326&outputFormat=csv

This data set contains information about each of the public transport stations. This data set was chosen, as the geolocation of each transport station is contained in it.

### WIENTOURISMUSOGD.csv
Source: https://data.wien.gv.at/daten/geo?service=WFS&request=GetFeature&version=1.1.0&srsName=EPSG:4326&outputFormat=csv&typeName=ogdwien:WIENTOURISMUSOGD

This data set contains information about points of interests for tourists. This data set was chosen, as it is offers a wide variety of different types of locations tourists might be interested in.

TODO: Describe data set in more detail

### district_to_post.csv

Source: Handmade by authors of this project

This data set is basically just a mapping file, which provides a conversion between the postal code and the name of viennese district. This data set was created to join the listings.csv with the other data sets.

### rent_buy.csv
Source: https://www.immopreise.at/Wien/Wohnung/Miete
Handscrapped on 16/12/2022

This data set contains information about the average rent per square metre and average price per square metre of appartements in Vienna. This data set was chosen to provide information about expenses of an AirBnb appartement.

### model_data.csv

Source: Created by this Jupyternotebook

This data set was created out of the ones above and provides the data for the further analysis.

## Imports
The following libraries are used in this project. We also turned of the chained_assignment mode as this would cause problems, we avoid by reassigning the data sets instead of copying.

In [2]:
import pandas as pd
import geopy.distance
import numpy as np
from shapely.geometry import Point
from shapely.geometry.polygon import Polygon
import shapely.wkt
import geopandas as gpd
import matplotlib.pyplot as plt
from shapely.prepared import prep
from sklearn.linear_model import LinearRegression
from scipy import stats
from statsmodels.api import OLS
pd.options.mode.chained_assignment = None

## Data cleaning

First we load the data set, then we remove the columns we do not need for our analysis. As shown below, the data set contains multiple columns containing information about the host, this is not needed for either of the target questions. Also, the information about the source, when it was scrapped or the url of the listings are dropped. The name and the description of the individual listing are not needed either, as this tasks focus mainly on price and location. Therefore, we only keep the columns that are related to price (`price` and `accommodates` as well as `bedrooms` as these will be needed to calculate the price per person) or to the location (`neighbourhood_cleansed` and the geolocation). The other columns that are kept either contain critical information to identify the listing (`ìd`) or provide additional information that might influence the price or the review of a location, without it being dependent on the host.

In [24]:
listings = pd.read_csv("data/listings.csv")
listings.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'ca

In [25]:
listings_cleaned = listings[["id", "neighbourhood_cleansed", "latitude", "longitude", "property_type",
                             "room_type", "accommodates", "bedrooms", "price","number_of_reviews",
                             "review_scores_rating", "review_scores_location", "reviews_per_month"]]

In a next step we clean the `price` column of the $ sign, so we can later on cast it to numeric. Furthermore, we assign the correct data types to the columns.

In [31]:
listings_cleaned['price'] = listings_cleaned['price'].str.extract('(\d+)', expand=False)
listings_cleaned[['price']] = listings_cleaned[['price']].apply(pd.to_numeric)
# TODO Cast the other columns as well, continue the cleaning 

In [None]:
listings = pd.read_csv("data/listings.csv")
listings_cleaned = listings[["id", "name", "neighbourhood_cleansed", "latitude", "longitude", "property_type",
                             "room_type", "accommodates", "bedrooms", "price","number_of_reviews",
                             "review_scores_rating", "review_scores_location", "reviews_per_month"]]
listings_cleaned = listings_cleaned.dropna()
listings_cleaned[['price']] = listings_cleaned[['price']].apply(pd.to_numeric)
listings_cleaned[['accommodates']] = listings_cleaned[['accommodates']].apply(pd.to_numeric)
listings_cleaned['price_per_person'] = listings_cleaned['price']/listings_cleaned['accommodates']