## Dublin AirBnb  -  Data analysis project 

Data: 
•	Csv file: ‘lisitngs.csv’. A web-scraped csv file from the Airbnb website from ‘https://insideairbnb.com/get-the-data/’. The data is made up for 75 columns and 4734 rows. It contains information regarding the listing url, name and description of the listing, information about the host such as name, about, location, response time etc.details
•	Host; name, location, date they started hosting, about, if they are a superhost, how many listings they have.
•	Accommodation: Property type, Longitude, latitude,neighbourhood(~50% are blank), neighbourhood_cleansed (4 “neighbourhoods” of Dublin), number of bathrooms, bedrooms, beds, amenities, the price, minimum and maximum number of nights.
•	Reviews: Overall rating, cleanliness, check in, communication, value, location.

•	Geojson file ‘neighbourhoods.geojson’ containing three columns one of which ‘neighbourhood_group’ is blank. The other columns are neighbourhood containing the values “Dublin, City, South Dublin, Fingal, Dn Laoghaire-Rathdown” which aligns with the neighbourhood_cleansed column in the listings file. The last column in geometry which contains the Polygon files for each of the 4 neighbourhoods.


### Load and Clean

In [1]:
import pandas as pd
import geopandas as gpd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


In [2]:
#### Load in the two files
listings = pd.read_csv('listings.csv')
dub_nb =  gpd.read_file('neighbourhoods.geojson')

In [3]:
### Fistly we'll look at the listings information:
listings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4734 entries, 0 to 4733
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            4734 non-null   int64  
 1   listing_url                                   4734 non-null   object 
 2   scrape_id                                     4734 non-null   int64  
 3   last_scraped                                  4734 non-null   object 
 4   source                                        4734 non-null   object 
 5   name                                          4734 non-null   object 
 6   description                                   4626 non-null   object 
 7   neighborhood_overview                         2261 non-null   object 
 8   picture_url                                   4734 non-null   object 
 9   host_id                                       4734 non-null   i

The price column is an object as it contains both $ and commas, we want to convert it to a numeric.
The neighbourhood column contains lots of missing values but may be of use later, so to make the merge easier between the listings and the geojson data we'll rename the neighbourhood column to neighbourhood_full and rename neighbourhood_cleansed to neighbourhood. 
We also want to build a geometry column using the longitude and latitude columns.

In [4]:
listings['price'] = listings['price'].str.replace('$', '', regex=False).str.replace(',', '', regex=False)
listings['price'] = pd.to_numeric(listings['price'])
listings.rename(columns = {'neighbourhood': 'full_neighbourhood','neighbourhood_cleansed': 'neighbourhood'}, inplace = True)

# listings['acomd_geometry'] = gpd.GeoDataFrame(
#    listings, geometry = gpd.points_from_xy(x=listings.longitude, y = listings.latitude))

listings.crs = dub_nb.crs

In [5]:
listings = listings[listings['price'] != 45880.0] ## Test account
listings = listings[listings['price'] != 8820.0]  ## Test Account

##listings = listings[~listings['minimum_minimum_nights'].isin([365, 359])] ## Long term 

In [6]:
## What to make a variable for how much it would cost per person accomdates/price
listings['price_per_person'] = listings['accommodates']/listings['price']

##### Max, min and median price for overall and for each neighbourhood

In [10]:
listings['price'].agg(['max', 'min', 'median'])

max       14246.0
min          15.0
median      130.0
Name: price, dtype: float64

In [8]:
listings.groupby('neighbourhood').agg({'price':['max', 'min', 'median']})


Unnamed: 0_level_0,price,price,price
Unnamed: 0_level_1,max,min,median
neighbourhood,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Dn Laoghaire-Rathdown,1495.0,22.0,146.0
Dublin City,2500.0,15.0,140.0
Fingal,14246.0,22.0,103.0
South Dublin,1700.0,25.0,81.5


##### The number of each property type along with the max, min and median for each

In [29]:
price_stats = listings.groupby('property_type')['price'].agg(['max', 'min', 'median'])

# Get the count of each property type
property_counts = listings['property_type'].value_counts()

# Combine the count data with the price statistics
combined_data = price_stats.join(property_counts.rename('count'))
combined_data = combined_data.sort_values('count', ascending=False)
combined_data

Unnamed: 0_level_0,max,min,median,count
property_type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Entire rental unit,1334.0,46.0,167.0,1100
Private room in home,1700.0,22.0,81.0,1036
Entire home,2500.0,55.0,280.0,703
Private room in rental unit,1570.0,26.0,90.0,444
Entire condo,2500.0,61.0,179.5,382
Private room in condo,499.0,33.0,90.0,216
Private room in townhouse,450.0,40.0,90.5,116
Entire townhouse,950.0,90.0,300.0,93
Entire serviced apartment,519.0,97.0,226.0,68
Private room in bed and breakfast,265.0,46.0,99.0,67


In [32]:
listings['property_type'].value_counts()

Entire rental unit                    1100
Private room in home                  1036
Entire home                            703
Private room in rental unit            444
Entire condo                           382
Private room in condo                  216
Private room in townhouse              116
Entire townhouse                        93
Entire serviced apartment               68
Private room in bed and breakfast       67
Shared room in home                     58
Private room in guesthouse              45
Entire guesthouse                       41
Entire guest suite                      38
Shared room in hostel                   35
Shared room in rental unit              34
Entire cottage                          29
Room in hotel                           25
Room in boutique hotel                  20
Entire bungalow                         19
Entire loft                             19
Entire cabin                            17
Private room in bungalow                17
Tiny home  

In [62]:
grouped_by_host = listings.groupby('host_name').agg({
    'id': 'count',  # Number of listings per host
    'price': 'mean',  # Average price of listings per host
    'review_scores_rating': 'mean'  # Average review score
}).rename(columns={'id': 'number_of_listings'})

grouped_by_host = grouped_by_host.sort_values('number_of_listings', ascending=False)
grouped_by_host = grouped_by_host.reset_index()
grouped_by_host.head(15)

Unnamed: 0,host_name,number_of_listings,price,review_scores_rating
0,Paul,113,165.893805,4.697273
1,Daniel And G,79,228.481013,4.490556
2,Ian,59,166.508475,4.743143
3,David,51,198.823529,4.817436
4,Lucas,45,110.688889,4.519091
5,Daniel,44,207.590909,4.679
6,James,43,222.744186,4.5925
7,John,42,185.857143,4.626471
8,Mary,38,144.815789,4.855588
9,Mark,36,223.194444,4.797879


In [59]:
grouped_by_host.index.isna().any()

False

In [64]:
top_10_hosts = grouped_by_host.head(10)
sum(top_10_hosts['number_of_listings'])
## Top 10 accounts control 550 lisitngs

550

In [65]:
top_10_hosts

Unnamed: 0,host_name,number_of_listings,price,review_scores_rating
0,Paul,113,165.893805,4.697273
1,Daniel And G,79,228.481013,4.490556
2,Ian,59,166.508475,4.743143
3,David,51,198.823529,4.817436
4,Lucas,45,110.688889,4.519091
5,Daniel,44,207.590909,4.679
6,James,43,222.744186,4.5925
7,John,42,185.857143,4.626471
8,Mary,38,144.815789,4.855588
9,Mark,36,223.194444,4.797879
