# Berlin AirBnB Dataset

## Tasks to be performed are -
### 1.Thorough Analysis of The Data
### 2. Making a Prediction Model From The Appropriate Features

# Data Analysis

## Imports

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder
pd.options.display.max_columns = None
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split


## Reading The Data and basic display


In [None]:
data = pd.read_csv("/kaggle/input/berlin-airbnb-data/listings.csv")

In [None]:
data

In [None]:
data.describe()

We can see that there are some irregularity in the data. Some points to be noted are the following - 
1. Most features have same number of rows except for reviews per month, it has some missing columns that need to be adjusted.
2. For the pirce feature we can see that the mean is 67 that is acceptable but we can also see extremes on both ends, such as the max price is 9000 and minimum is 0. This caused the variance and std to increase for the price data, it requires further observation.
3. Similar extreme unacceptable values can be seen with minimum_nights as well as the max is 5000 nights which is practically impossible.

Let's explore the data a bit more

### The correlation in the data can be observed but it should be observed again after removing extreme values.

In [None]:
data.corr()

In [None]:
plt.figure(figsize=(30,10))
sns.heatmap(data=data.corr(), annot=True)

No strong  correlation can be observed between the data. However we will look into each of the features and look for patterns amongst the data. After which we will process the data a bit and then check for correlations again.

### Cleaning The Data

1. We will clean the data for extreme values of prices and minimum_nights. We can see that the mean for the price is 67 and std is 200 so we can assume prices that are over 1000 are extremes and we will remove them. also the prices that are 0 will be removed as we will not consider any listing to be free.

### Cleaning the Price Data

In [None]:
# KDE plot 
plt.figure(figsize=(10,5))
sns.kdeplot(data=data.price, shade=True)

In [None]:
sum(data.price == 0) + sum(data.price >600)

In [None]:
data = data[data.price != 0 ]
data = data[data.price <=600 ]

In [None]:
# KDE plot 
plt.figure(figsize=(10,5))
sns.kdeplot(data=data.price, shade=True)

### Cleaning the Minimum Nights Data



In [None]:
# KDE plot 
plt.figure(figsize=(20,10))
sns.kdeplot(data.minimum_nights, shade=True)

In [None]:
sum(data.minimum_nights>90)

In [None]:
data = data[data.minimum_nights <=90 ]

In [None]:
# KDE plot 
plt.figure(figsize=(20,10))
sns.kdeplot(data.minimum_nights, shade=True)

## Analysis of ID, Name, Host ID and Hostname

Our main focus is on id and host id as we assume that the names refer to one of the IDs or host ids. We can see that there are fewer number of host id, that means soem hosts have multiple listings(rooms/apartments).

In [None]:
sum(data.host_id.isnull()==True)

In [None]:
print("Unique ID : ", len(data.id.unique()))
print("Unique host ID : ", len(data.host_id.unique()))

## Geo-Spatial Visualization of the Data

We will use the longitude and lattitude to plot the geospatial data. For some of the upcoming features we will also plot them geospatially to see if patterns can be identified.

In [None]:
plt.figure(figsize=(20,10))

sns.scatterplot(x=data['latitude'], y=data['longitude'],size=15, color=sns.color_palette('winter', n_colors=1))
plt.show()

## Analysis of Neighbourhood Groups

In [None]:
data.neighbourhood_group.isnull().any()

### Geographical View of the neighbourhoods

Fantastic!!! This is brilliant. We have a good representation of the neighbourhood data, we can see the different neighbourhoods locations, the density of listings in the neighbourhoods. The points that we can draw are-

1. There are more listings in the central neighbourhoods(We can also see from the value counts), which are mainly - Friedrichshain-Kreuzberg, Mitte, Pankow, Neukölln. These places are closer to the city center, therefore has more listings.

2. We can see that after the central neighbourhoods, The SOUTHERN NEIGHBOURHOODS have more density in listings which are Schoneberg and Charlottenberg. That means that residential facilities(housing,communication, markets) are more available on the SOUTHERN PART of the city.

### We also have the smaller neighbourhood feature. We will look into it later, for now we want to analyse the larger neighbourhood groups more, in terms of relationships with other features

In [None]:
plt.figure(figsize=(25,15))
sns.set_style('white')
customPalette = ['#800000', '#e6194B', '#f58231', '#ffe119', '#3cb44b', '#42d4f4', '#911eb4', '#000000', '#000075', '#444444', '#008080', '#ec0101']
sns.scatterplot(x=data['latitude'], y=data['longitude'],hue=data["neighbourhood_group"], palette=sns.set_palette(customPalette))
plt.show()

In [None]:
plt.figure(figsize=(30,10))
sns.barplot(x=data.neighbourhood_group.value_counts().index, y=data.neighbourhood_group.value_counts(),palette=sns.color_palette('magma', n_colors=12))
plt.show()

In [None]:
data.room_type.value_counts()

It is difficult observe the relation between types of rooms and neighbourhood as it is very dense but it seems like Private rooms and Entire home/apt are evenly distributed as their numbers are roughly the same

In [None]:
plt.figure(figsize=(25,15))
markers = {"Private room": "s", "Entire home/apt": "X", "Shared room":"o"}
customPalette = ['#800000', '#e6194B', '#f58231', '#ffe119', '#3cb44b', '#42d4f4', '#911eb4', '#000000', '#000075', '#444444', '#008080', '#ec0101']
sns.scatterplot(x=data['latitude'], y=data['longitude'],hue=data["neighbourhood_group"], palette=sns.set_palette(customPalette),style=data['room_type'], markers=markers)
plt.show()

### Now we want to see the relationship between price and the neighbourhood

In [None]:
data.price.min()

In [None]:
temp = data
temp["price_75"] = data.price>data.price.quantile(0.75)


As it is hard to understand the pattern between neighbourhood and price from geaospatial view, we will try a different approach

In [None]:
plt.figure(figsize=(25,15))
markers = {True: "o", False: "X"}

sns.scatterplot(x=temp['latitude'], y=temp['longitude'],hue=temp["neighbourhood_group"], palette=sns.set_palette(customPalette),style=temp['price_75'], markers=markers)
plt.show()

#### Overall comparison of price and neighbourhood

In [None]:
plt.figure(figsize=(40,10))
sns.barplot(x=data['neighbourhood_group'], y=data['price'], palette=sns.color_palette('magma', n_colors=12))

From the overall view, much cannot be observed, seems like the price in all neighbourhoods are almost the same. Two neighbourhoods, charlotteberg and Schoneberg seems to have relatively higher price. We may see better patterns if we look at count of listings comparing the min, mix, third quantile. However, we cannot compare with min and max, as they are extreme values, but hopefully the top third quartile value will give a good insight.

#### Comparing counts of third quantile of price and Neighbourhood

In [None]:
plt.figure(figsize=(35,10))
plt.title("Third quartile Price Comparison",fontsize=20)
sns.countplot(temp.neighbourhood_group,hue=temp["price_75"], palette=sns.color_palette('magma', n_colors=2))
plt.show()

Great !!! from the 3rd quantile we can get quite a few ideas and it supports the hypothesis of price being higher in city center more. 

We can see that the listing which are in the neighbourhoods of the city center have a higher count of the listings costing more than the third quantile. It may imply that the price of the listings near the city center have a higher price. We can plot this geospatially and see.

In [None]:
plt.figure(figsize=(25,15))

sns.scatterplot(x=data['latitude'], y=data['longitude'],hue=temp.price_75, palette=sns.color_palette('prism', n_colors=2), alpha=0.5)
plt.show()

In [None]:
plt.figure(figsize=(35,10))
sns.swarmplot(x=data['neighbourhood_group'],
              y=data['price'])

### Comparison of Minimum Nights and Neighbourhoods

As this will be hard to see from geospatial plotting we will use bar charts to display the data.

In [None]:
plt.figure(figsize=(40,10))
sns.barplot(x=data['neighbourhood_group'], y=data['minimum_nights'], palette=sns.color_palette('magma', n_colors=12))

We can see that the neighbourhood Spandou has on average higher number of minimum nights of stay, a reason maybe due to it being far away from the city center. Also it has less dense listing, so we can see a alarge standard error in the data.

That's pretty much all that be observed from the data for comparison between the neighbourhood groups and minimum nights.

### Observation between Reviews, Reviews per Month and Neighbourhood

In [None]:
plt.figure(figsize=(40,10))
sns.barplot(x=data['neighbourhood_group'], y=data['number_of_reviews'], palette=sns.color_palette('magma', n_colors=12))

In [None]:
plt.figure(figsize=(40,10))
sns.barplot(x=data['neighbourhood_group'], y=data['reviews_per_month'], palette=sns.color_palette('magma', n_colors=12))

Actually nothing significant can be observed from the review data. We cannot predict a good or bad listing from the just number of reviews or the reviews rate. However this data may suggest that which neighbourhoods listings get more visitors as more reviews may mean more visitors. We can have a look at the third quantile range crossing review rate to see if we can observe any pattern in which neighbourhood listings are more busy.


In [None]:
data["TQmorereviews"] = data["reviews_per_month"]>data["reviews_per_month"].quantile(0.75)

In [None]:
plt.figure(figsize=(25,15))

sns.scatterplot(x=data['latitude'], y=data['longitude'],hue=data.TQmorereviews, palette=sns.color_palette('prism', n_colors=2), alpha=0.5)
plt.show()

In [None]:
plt.figure(figsize=(35,10))
plt.title("Third quantile Monthly Reviews Rate Comparison",fontsize=20)
sns.countplot(data.neighbourhood_group,hue=data["TQmorereviews"], palette=sns.color_palette('magma', n_colors=2))
plt.show()

We can see from the geospatial map that compared to the density that higher review rated places are all over the map, that means there are places in every neighbourhood that receives higher number of reviews and it is not centralized. The Bar comparison shows that some neighbourhoods have higher counts of greater number of reviews but it is due to the fact that the density of listings in those regions are lower. If we look at the neighbourhoods far from the city center we see the ratio of false to true get lower.

### Occupied Time of Listings Depending on the Neighbourhood - Relation of host listings count, availability and neighbourhood

In [None]:
temp = data
temp["TQcalculated_host_listings_count"] = temp["calculated_host_listings_count"]>temp["calculated_host_listings_count"].quantile(0.75)

In [None]:
plt.figure(figsize=(40,10))
sns.barplot(x=data['neighbourhood_group'], y=data['calculated_host_listings_count'], palette=sns.color_palette('magma', n_colors=12))

In [None]:
plt.figure(figsize=(35,10))
plt.title("Third quantile Host listing Count Comparison",fontsize=20)
sns.countplot(temp.neighbourhood_group,hue=temp["TQcalculated_host_listings_count"], palette=sns.color_palette('magma', n_colors=2))
plt.show()

In [None]:
plt.figure(figsize=(25,15))

sns.scatterplot(x=temp['latitude'], y=temp['longitude'],hue=temp.TQcalculated_host_listings_count, palette=sns.color_palette('prism', n_colors=2), alpha=0.5)
plt.show()

We can see that the ratio of busy to not busy listings ratio is more on the neighbourhoods far from the center. It may be due to the fact that there are less options as you move far from the city center so you have to pick the local best option. This makes the loacally best listings to get more visitors.

In [None]:
plt.figure(figsize=(40,10))
sns.barplot(x=data['neighbourhood_group'], y=data['availability_365'], palette=sns.color_palette('magma', n_colors=12))

In [None]:
temp = data
#Third quartile is for less busy listings
temp["TQavailability_365"] = temp["availability_365"]>temp["availability_365"].quantile(0.75)
temp["FQavailability_365"] = temp["availability_365"]>temp["availability_365"].quantile(0.50)


In [None]:
plt.figure(figsize=(25,15))
plt.title("Third quantile availability365 Listings(Less busy ones)")
sns.scatterplot(x=temp['latitude'], y=temp['longitude'],hue=temp.TQcalculated_host_listings_count, palette=sns.color_palette('magma', n_colors=2), alpha=0.5)
plt.show()


In [None]:

plt.figure(figsize=(25,15))
plt.title("Mean Comparison availability365 Listings(More busy ones)")
sns.scatterplot(x=temp['latitude'], y=temp['longitude'],hue=temp.FQavailability_365, palette=sns.color_palette('magma', n_colors=2), alpha=0.5)
plt.show()

In [None]:
plt.figure(figsize=(35,10))
plt.title("Third quartile Host listing Count Comparison",fontsize=20)
sns.countplot(temp.neighbourhood_group,hue=temp["FQavailability_365"], palette=sns.color_palette('magma', n_colors=2))
plt.show()

In [None]:
plt.figure(figsize=(35,10))
plt.title("Third quantile Host listing Count Comparison",fontsize=20)
sns.countplot(temp.neighbourhood_group,hue=temp["TQavailability_365"], palette=sns.color_palette('magma', n_colors=2))
plt.show()

It seems like that the listings that are near the city center are more available that means that they less number of hosts throughout the year. We can see the centra; most neighbourhood kreuzberg has more listings on average free than the busy ones. This maybe due to the fact that there is more competition in the city center and also there are hotels near the city center therefore they get less hosts and are more free.

It is hard to see from the third quartile mapping but if we see from the mean mapping, we see that on the outskirts of the city very few listings have availability higher than the mean which means they are more occupied with hosts.

In [None]:
data.head()

In [None]:
data = data.drop(["price_75","TQmorereviews","TQcalculated_host_listings_count","TQavailability_365","FQavailability_365"], axis=1)

## We have analysed the data for the larger negihbourhood groups. Now we can have a look at the smaller groups and see if there are any patterns with the prices.


We will take 4 Larger neighbourhoods and analyse the price distribution in those geographical data

In [None]:
data.neighbourhood_group.value_counts()

### Friedrichshain-Kreuzberg Data

In [None]:
data.shape

In [None]:
Kreu_data = data[data.neighbourhood_group == "Friedrichshain-Kreuzberg"]

In [None]:
Kreu_data.shape

In [None]:
plt.figure(figsize=(25,15))
sns.set_style('white')
customPalette = ['#800000', '#e6194B', '#f58231', '#ffe119', '#3cb44b', '#42d4f4', '#911eb4', '#000000', '#000075', '#444444', '#008080', '#ec0101']
sns.scatterplot(x=Kreu_data['latitude'], y=Kreu_data['longitude'],hue=Kreu_data["neighbourhood"], palette=sns.set_palette(customPalette))
plt.show()

In [None]:
plt.figure(figsize=(25,15))

sns.scatterplot(x=Kreu_data['latitude'], y=Kreu_data['longitude'],hue=Kreu_data.price>Kreu_data.price.quantile(0.75), palette=sns.color_palette('prism', n_colors=2), alpha=0.5)
plt.show()

In [None]:
plt.figure(figsize=(35,10))
plt.title("Third quantile Host listing Count Comparison",fontsize=20)
sns.countplot(Kreu_data.neighbourhood,hue=Kreu_data.price>Kreu_data.price.quantile(0.5), palette=sns.color_palette('magma', n_colors=2))
plt.show()

We can see that for Kreuzberg, the listings price are well distributed, no region specifically has more expensive listings.

### Mitte Data

In [None]:
mitte_data = data[data.neighbourhood_group == "Mitte"]

In [None]:
mitte_data.shape

In [None]:
plt.figure(figsize=(25,15))
sns.set_style('white')
customPalette = ['#800000', '#e6194B', '#f58231', '#ffe119', '#3cb44b', '#42d4f4', '#911eb4', '#000000', '#000075', '#444444', '#008080', '#ec0101']
sns.scatterplot(x=mitte_data['latitude'], y=mitte_data['longitude'],hue=mitte_data["neighbourhood"], palette=sns.set_palette(customPalette))
plt.show()

In [None]:
plt.figure(figsize=(25,15))

sns.scatterplot(x=mitte_data['latitude'], y=mitte_data['longitude'],hue=mitte_data.price>mitte_data.price.quantile(0.75), palette=sns.color_palette('prism', n_colors=2), alpha=0.5)
plt.show()

In [None]:
plt.figure(figsize=(35,10))
plt.title("Third quantile Host listing Count Comparison",fontsize=20)
sns.countplot(mitte_data.neighbourhood,hue=mitte_data.price>mitte_data.price.quantile(0.75), palette=sns.color_palette('magma', n_colors=2))
plt.show()

In [None]:
# plt.figure(figsize=(35,10))
# sns.swarmplot(x=mitte_data['neighbourhood'],
#               y=mitte_data['price'])

We can see something interesting for the Mitte neighbourhood data. We can see that the expensive places are more on the left side, meaning closer to the city center. So for this neighbourhood we can see that some smaller neighbourhoods have effect on the price.

### Steglitz - Zehlendorf Data

In [None]:
#Steglitz - Zehlendorf
Zehl_data = data[data.neighbourhood_group == "Steglitz - Zehlendorf"]

In [None]:
plt.figure(figsize=(25,15))
sns.set_style('white')
customPalette = ['#800000', '#e6194B', '#f58231', '#ffe119', '#3cb44b', '#42d4f4', '#911eb4', '#000000', '#000075', '#444444', '#008080', '#ec0101']
sns.scatterplot(x=Zehl_data['latitude'], y=Zehl_data['longitude'],hue=Zehl_data["neighbourhood"], palette=sns.set_palette(customPalette))
plt.show()

In [None]:
plt.figure(figsize=(25,15))

sns.scatterplot(x=Zehl_data['latitude'], y=Zehl_data['longitude'],hue=Zehl_data.price>Zehl_data.price.quantile(0.75), palette=sns.color_palette('prism', n_colors=2), alpha=0.5)
plt.show()

In [None]:
plt.figure(figsize=(35,10))
plt.title("Third quantile Host listing Count Comparison",fontsize=20)
sns.countplot(Zehl_data.neighbourhood,hue=Zehl_data.price>Zehl_data.price.quantile(0.75), palette=sns.color_palette('magma', n_colors=2))
plt.show()

This neighbourhood doesnt have much density in listings so it is hard to see any patterns of listing prices distribution

### Charlottenburg-Wilm. Data

In [None]:
#Charlottenburg-Wilm.
Char_data = data[data.neighbourhood_group == "Charlottenburg-Wilm."]

In [None]:
plt.figure(figsize=(25,15))
sns.set_style('white')
customPalette = ['#800000', '#e6194B', '#f58231', '#ffe119', '#3cb44b', '#42d4f4', '#911eb4', '#000000', '#000075', '#444444', '#008080', '#ec0101']
sns.scatterplot(x=Char_data['latitude'], y=Char_data['longitude'],hue=Char_data["neighbourhood"], palette=sns.set_palette(customPalette))
plt.show()

In [None]:
plt.figure(figsize=(25,15))

sns.scatterplot(x=Char_data['latitude'], y=Char_data['longitude'],hue=Char_data.price>Char_data.price.quantile(0.75), palette=sns.color_palette('prism', n_colors=2), alpha=0.5)
plt.show()

In [None]:
plt.figure(figsize=(35,10))
plt.title("Third quantile Host listing Count Comparison",fontsize=20)
sns.countplot(Char_data.neighbourhood,hue=Char_data.price>Char_data.price.quantile(0.75), palette=sns.color_palette('magma', n_colors=2))
plt.show()

In [None]:
plt.figure(figsize=(35,10))
sns.swarmplot(x=Char_data['neighbourhood'],
              y=Char_data['price'])

We can see that some neighbourhoods do have a higher number of listings that have a price higher than third quantile. We can conclude that the smaller neighbourhood areas also have an effect on the price. Therefore we will consider it for training the model.

We have analysed that the geolocation has certain effect on the price. Now we will examine room types and prices

## Room Type and Price

In [None]:
plt.figure(figsize=(10,10))
sns.barplot(x=data['room_type'], y=data['price'], palette=sns.color_palette('magma', n_colors=12))

In [None]:
plt.figure(figsize=(10,10))
plt.title("Third quantile Host listing Count Comparison",fontsize=20)
sns.countplot(data.room_type,hue=data.price>data.price.quantile(0.75), palette=sns.color_palette('magma', n_colors=2))
plt.show()

As expected we can see that the price of entire home/apt is higher than a private room. Therefore this will be a crucial feature for determining the price.

## Observation of Minimum Nights and Price

In [None]:
plt.figure(figsize=(20,10))
sns.set_style('white')

sns.scatterplot(x=mitte_data['minimum_nights'], y=mitte_data['price'], palette=sns.color_palette('Blues', n_colors=2))
plt.show()

It seems like the listings that have a lower minimum nights have higher value but this is not the case. Most rooms have lower minimum nights thats why the expensive rooms seem to have lower minimum nights.

## Reviews and Prices

In [None]:
plt.figure(figsize=(20,10))
sns.set_style('white')

sns.scatterplot(x=data['number_of_reviews'], y=data['price'], palette=sns.color_palette('plasma', n_colors=2))
plt.show()

In [None]:
sns.lmplot(x="number_of_reviews", y="price",  data=data)

In [None]:
plt.figure(figsize=(20,10))
sns.set_style('white')

sns.scatterplot(x=data['reviews_per_month'], y=data['price'], palette=sns.color_palette('plasma', n_colors=2))
plt.show()

In [None]:
sns.lmplot(x="reviews_per_month", y="price",  data=data)

It doesn't seem like there is any strong relationship between reviews and prices.

## Listings Occupied Time and Price

In [None]:
plt.figure(figsize=(20,10))
sns.set_style('white')

sns.scatterplot(x=data['calculated_host_listings_count'], y=data['price'], palette=sns.color_palette('plasma', n_colors=2))
plt.show()

In [None]:
plt.figure(figsize=(20,10))
sns.set_style('white')

sns.scatterplot(x=data['availability_365'], y=data['price'], palette=sns.color_palette('plasma', n_colors=2))
plt.show()

In [None]:
data.head()