### Exploratory data analysis of Barcelona

In this section, we perform an Exploratory Data Analysis (EDA) on the Barcelona dataset to understand the underlying patterns, trends, and characteristics of the rental market in the city. Given Barcelona's significant influx of tourists, especially in central areas, this analysis aims to identify key factors influencing rental prices, occupancy rates, and the distribution of tourist accommodations.

The EDA will explore various features, including location data, pricing trends, and the impact of amenities, to provide a comprehensive overview of the rental landscape in Barcelona.

Summary of the Content:

1.Basic Descriptive Statistics

2.Neighborhood Analysis

    2.1. Top Neighborhoods
    2.2. Average Price by Neighborhood

3.Price Distribution

4.Review Analysis

    4.1. Distribution of Scores

5.Amenities Analysis

    5.1. Count of Listings with Specific Amenities

6.Room Type vs. Price

    6.1. Average Price by Room Type

7.Host Analysis

    7.1. Superhost vs. Price
    7.2. Number of Listings per Host



In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("AirbnbListingsAnalysis").getOrCreate()

24/08/17 10:05:27 WARN Utils: Your hostname, Mengges-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.1.130 instead (on interface en0)
24/08/17 10:05:27 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/08/17 10:05:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
# Load dataset
file_path = 'Datasets/Final_cleaned_dataset/bcn_final_cleaned_data_csv.csv'

# Load the cleaned data
df_bcn = spark.read.csv(file_path, sep=";", header=True, inferSchema=True)

# Display the schema to verify data types
# df_bcn.printSchema()
df_bcn.show(5)


+----------------------------+------+--------------------+--------------------+-------+--------------+------------------------------+-----------------+-----------------+-----------------+-----+-------+----------------+--------+----------------+---------+----------+------------------+-----------------+--------------------+-----------------+----------------+
|neighbourhood_group_cleansed|    id|         listing_url|                name|host_id|     host_name|calculated_host_listings_count|host_is_superhost|         latitude|        longitude|price|kitchen|patio or balcony|elevator|air conditioning|long_term|short_term|possible_long_term|number_of_reviews|review_scores_rating|room_type_encoded|bedrooms_encoded|
+----------------------------+------+--------------------+--------------------+-------+--------------+------------------------------+-----------------+-----------------+-----------------+-----+-------+----------------+--------+----------------+---------+----------+-----------------

## Summary of Columns:
    neighbourhood_group_cleansed: The neighborhood where the listing is located.
    id: Unique identifier for the listing.
    listing_url: URL of the listing.
    name: Name of the listing.
    host_id: Unique identifier for the host.
    host_name: Name of the host.
    calculated_host_listings_count: Number of listings managed by the host.
    host_is_superhost: Whether the host is a superhost (a trusted, highly-rated host).
    latitude & longitude: Geographic coordinates of the listing.
    price: Price per night for the listing.
    kitchen: Indicates if the listing has a kitchen.
    patio or balcony, elevator, air conditioning: Amenities available in the listing.
    long_term, short_term, possible_long_term: Indicates if the listing is available for long or short term.
    number_of_reviews: Number of reviews the listing has received.
    review_scores_rating: Average rating of the listing.
    room_type_encoded: Encoded value representing the type of room (e.g., entire home, private room).
    bedrooms_encoded: Encoded value indicating if the listing has more than one bedroom.

## 1. Basic Descriptive Statistics:

In [5]:
# Get an overview of numeric columns such as price, number_of_reviews, and review_scores_rating.
df_bcn.describe(["price", "number_of_reviews", "review_scores_rating"]).show()


+-------+------------------+-----------------+--------------------+
|summary|             price|number_of_reviews|review_scores_rating|
+-------+------------------+-----------------+--------------------+
|  count|             18773|            18773|               18773|
|   mean|195.56352207958238|46.75763063974858|   3.438658179300036|
| stddev|294.99165058230136|96.78660650621215|  2.0380977270379357|
|    min|              10.0|                0|                 0.0|
|    max|           13714.0|             2121|                 5.0|
+-------+------------------+-----------------+--------------------+



## 2. Neighborhood Analysis:
### 2.1 Top Neighborhoods by Listing Count:

In [6]:
df_bcn.groupBy("neighbourhood_group_cleansed").count().orderBy("count", ascending=False).show()

+----------------------------+-----+
|neighbourhood_group_cleansed|count|
+----------------------------+-----+
|                    Eixample| 6692|
|                Ciutat Vella| 4390|
|              Sants-Montjuïc| 1969|
|                  Sant Martí| 1745|
|                      Gràcia| 1573|
|         Sarrià-Sant Gervasi|  988|
|              Horta-Guinardó|  557|
|                   Les Corts|  403|
|                 Sant Andreu|  316|
|                  Nou Barris|  224|
+----------------------------+-----+



### 2.2 Average Price by Neighborhood:

In [7]:
df_bcn.groupBy("neighbourhood_group_cleansed").agg({"price": "mean"}).orderBy("avg(price)", ascending=False).show()

+----------------------------+------------------+
|neighbourhood_group_cleansed|        avg(price)|
+----------------------------+------------------+
|                    Eixample|238.89269098003902|
|                      Gràcia|194.56741214057507|
|                  Sant Martí|193.31805157593124|
|              Sants-Montjuïc| 189.4425140521206|
|                Ciutat Vella|166.53377604762994|
|         Sarrià-Sant Gervasi|164.52385786802031|
|                   Les Corts|156.82133995037222|
|              Horta-Guinardó| 120.1374321880651|
|                 Sant Andreu| 93.79617834394904|
|                  Nou Barris| 84.15837104072398|
+----------------------------+------------------+



## 3. Price Distribution:

In [8]:
df_bcn.select("price").summary().show()

+-------+------------------+
|summary|             price|
+-------+------------------+
|  count|             18773|
|   mean|195.56352207958238|
| stddev|294.99165058230136|
|    min|              10.0|
|    25%|              82.0|
|    50%|             167.0|
|    75%|             239.0|
|    max|           13714.0|
+-------+------------------+



## 4.  Review Analysis:
### 4.1 Review Scores Distribution:

In [9]:
df_bcn.groupBy("review_scores_rating").count().orderBy("review_scores_rating").show()

+--------------------+-----+
|review_scores_rating|count|
+--------------------+-----+
|                NULL|   84|
|                 0.0| 4723|
|                 1.0|   67|
|                1.25|    1|
|                 1.5|    2|
|                1.67|    1|
|                 2.0|   48|
|                2.17|    1|
|                2.25|    1|
|                2.33|    2|
|                2.45|    1|
|                 2.5|   19|
|                 2.6|    1|
|                2.63|    1|
|                2.67|   10|
|                2.71|    1|
|                2.75|    1|
|                 2.8|    1|
|                2.83|    1|
|                2.86|    1|
+--------------------+-----+
only showing top 20 rows



In [10]:
# 4.2 Average Review Scores by Neighborhood:
df_bcn.groupBy("neighbourhood_group_cleansed").agg({"review_scores_rating": "mean"}).orderBy("avg(review_scores_rating)", ascending=False).show()


+----------------------------+-------------------------+
|neighbourhood_group_cleansed|avg(review_scores_rating)|
+----------------------------+-------------------------+
|                    Eixample|       3.5628815848716635|
|                  Sant Martí|        3.558103151862467|
|              Sants-Montjuïc|       3.5348134900357664|
|                      Gràcia|       3.4701405750798706|
|                Ciutat Vella|        3.379420654911832|
|              Horta-Guinardó|        3.212784810126581|
|                   Les Corts|       3.1971712158808923|
|                 Sant Andreu|        3.186210191082802|
|                  Nou Barris|       3.0209954751131214|
|         Sarrià-Sant Gervasi|       2.8081116751269044|
+----------------------------+-------------------------+



## 5.  Amenities Analysis:
### Count of Listings with Specific Amenities:

In [11]:
df_bcn.groupBy("kitchen", "air conditioning", "elevator", "patio or balcony").count().show()

+-------+----------------+--------+----------------+-----+
|kitchen|air conditioning|elevator|patio or balcony|count|
+-------+----------------+--------+----------------+-----+
|      0|               0|       0|               1|   72|
|      1|               0|       1|               1| 1294|
|      1|               1|       0|               0| 2933|
|      1|               1|       1|               1| 3307|
|      0|               1|       0|               1|   78|
|      0|               0|       0|               0|  387|
|   NULL|            NULL|    NULL|            NULL|   84|
|      1|               0|       0|               0| 2520|
|      0|               1|       1|               0|  464|
|      0|               0|       1|               1|   57|
|      0|               1|       1|               1|  140|
|      0|               0|       1|               0|  244|
|      0|               1|       0|               0|  417|
|      1|               0|       0|               1| 100

## 6. Room Type vs. Price: 
### Average Price by Room Type:

In [12]:
df_bcn.groupBy("room_type_encoded").agg({"price": "mean"}).orderBy("avg(price)", ascending=False).show()

+-----------------+------------------+
|room_type_encoded|        avg(price)|
+-----------------+------------------+
|                2|239.46739617902952|
|                1|131.63665551839466|
|                0|117.50335570469798|
|             NULL|              NULL|
+-----------------+------------------+



## 7. Host Analysis:
### 7.1 Superhost vs. Price:

In [13]:
df_bcn.groupBy("host_is_superhost").agg({"price": "mean"}).orderBy("avg(price)", ascending=False).show()

+-----------------+------------------+
|host_is_superhost|        avg(price)|
+-----------------+------------------+
|                t|209.39581657280772|
|                f|192.13487104493487|
|             NULL|              NULL|
+-----------------+------------------+



### 7.2 Host Listings Count:

In [14]:
df_bcn.groupBy("calculated_host_listings_count").agg({"price": "mean"}).orderBy("avg(price)", ascending=False).show()

+------------------------------+------------------+
|calculated_host_listings_count|        avg(price)|
+------------------------------+------------------+
|                           207| 808.9558011049724|
|                            23| 449.3623188405797|
|                            36| 379.6111111111111|
|                           195| 366.9846153846154|
|                            43| 332.3953488372093|
|                            30|328.26666666666665|
|                           140|            325.65|
|                            52| 302.9807692307692|
|                            80|          296.1625|
|                            34|292.19117647058823|
|                            35| 292.1771428571429|
|                            66|        290.515625|
|                            76|287.88157894736844|
|                            37| 274.7090909090909|
|                            44|274.40909090909093|
|                            50|            273.68|
|           

In [15]:
## Geographic Distribution of Listings:

In [16]:
# Collect latitude and longitude data for visualization
geo_data = df_bcn.select("latitude", "longitude").toPandas()

# Save to CSV or another format for use in visualization tools
geo_data.to_csv("geographic_distribution.csv", index=False)

In [17]:
spark.stop()