### Exploratory data analysis of Madrid

In this section, we perform an Exploratory Data Analysis (EDA) on the Madrid dataset to understand the underlying patterns, trends, and characteristics of the rental market in the city. Given Madrid's significant influx of tourists, especially in central areas, this analysis aims to identify key factors influencing rental prices, occupancy rates, and the distribution of tourist accommodations.

The EDA will explore various features, including location data, pricing trends, and the impact of amenities, to provide a comprehensive overview of the rental landscape in Madrid.

Summary of the Content:

1. Basic Descriptive Statistics

2. Neighborhood Analysis

    2.1 Top Neighborhoods
    2.2 Average Price by Neighborhood

3. Price Distribution

4. Review Analysis

    4.1 Distribution of Scores

5. Amenities Analysis

    5.1 Count of Listings with Specific Amenities

6. Room Type vs. Price

    6.1 Average Price by Room Type

7. Host Analysis

    7.1 Superhost vs. Price
    7.2 Number of Listings per Host

In [9]:
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Airbnb EDA") \
    .getOrCreate()

In [10]:
# Load the cleaned data from the directory
file_path = "Datasets/Final_Cleaned_Dataset/mad_final_cleaned_date.csv"

df_mad = spark.read.csv(file_path, header=True, inferSchema=True, sep=';')

df_mad.createOrReplaceTempView("airbnb_listings")

df_mad.show(5)

+----------------------------+-------------------+--------------------+--------------------+---------+---------------+------------------------------+-----------------+-----------------+-------------------+-----+-------+----------------+--------+----------------+---------+----------+------------------+-----------------+--------------------+-----------------+----------------+
|neighbourhood_group_cleansed|                 id|         listing_url|                name|  host_id|      host_name|calculated_host_listings_count|host_is_superhost|         latitude|          longitude|price|kitchen|patio or balcony|elevator|air conditioning|long_term|short_term|possible_long_term|number_of_reviews|review_scores_rating|room_type_encoded|bedrooms_encoded|
+----------------------------+-------------------+--------------------+--------------------+---------+---------------+------------------------------+-----------------+-----------------+-------------------+-----+-------+----------------+--------+-

In [11]:
# Display the schema to verify data types
df_mad.printSchema()

root
 |-- neighbourhood_group_cleansed: string (nullable = true)
 |-- id: long (nullable = true)
 |-- listing_url: string (nullable = true)
 |-- name: string (nullable = true)
 |-- host_id: string (nullable = true)
 |-- host_name: string (nullable = true)
 |-- calculated_host_listings_count: string (nullable = true)
 |-- host_is_superhost: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- price: double (nullable = true)
 |-- kitchen: integer (nullable = true)
 |-- patio or balcony: integer (nullable = true)
 |-- elevator: integer (nullable = true)
 |-- air conditioning: integer (nullable = true)
 |-- long_term: integer (nullable = true)
 |-- short_term: integer (nullable = true)
 |-- possible_long_term: integer (nullable = true)
 |-- number_of_reviews: double (nullable = true)
 |-- review_scores_rating: double (nullable = true)
 |-- room_type_encoded: integer (nullable = true)
 |-- bedrooms_encoded: integer (nullable = true)


In [12]:
df_mad.describe().show()

24/08/23 22:17:05 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

+-------+----------------------------+--------------------+--------------------+--------------------+--------------------+--------------+------------------------------+-----------------+-------------------+-------------------+-----------------+-------------------+-------------------+-------------------+------------------+--------------------+------------------+-------------------+------------------+--------------------+------------------+-------------------+
|summary|neighbourhood_group_cleansed|                  id|         listing_url|                name|             host_id|     host_name|calculated_host_listings_count|host_is_superhost|           latitude|          longitude|            price|            kitchen|   patio or balcony|           elevator|  air conditioning|           long_term|        short_term| possible_long_term| number_of_reviews|review_scores_rating| room_type_encoded|   bedrooms_encoded|
+-------+----------------------------+--------------------+---------------

                                                                                

## Summary of Columns:
    neighbourhood_group_cleansed: The neighborhood where the listing is located.
    id: Unique identifier for the listing.
    listing_url: URL of the listing.
    name: Name of the listing.
    host_id: Unique identifier for the host.
    host_name: Name of the host.
    calculated_host_listings_count: Number of listings managed by the host.
    host_is_superhost: Whether the host is a superhost (a trusted, highly-rated host).
    latitude & longitude: Geographic coordinates of the listing.
    price: Price per night for the listing.
    kitchen: Indicates if the listing has a kitchen.
    patio or balcony, elevator, air conditioning: Amenities available in the listing.
    long_term, short_term, possible_long_term: Indicates if the listing is available for long or short term.
    number_of_reviews: Number of reviews the listing has received.
    review_scores_rating: Average rating of the listing.
    room_type_encoded: Encoded value representing the type of room (e.g., entire home, private room).
    bedrooms_encoded: Encoded value indicating if the listing has more than one bedroom.

In [13]:
df_mad.columns

['neighbourhood_group_cleansed',
 'id',
 'listing_url',
 'name',
 'host_id',
 'host_name',
 'calculated_host_listings_count',
 'host_is_superhost',
 'latitude',
 'longitude',
 'price',
 'kitchen',
 'patio or balcony',
 'elevator',
 'air conditioning',
 'long_term',
 'short_term',
 'possible_long_term',
 'number_of_reviews',
 'review_scores_rating',
 'room_type_encoded',
 'bedrooms_encoded']

## 1. Basic Descriptive Statistics:

In [14]:
# Get an overview of numeric columns such as price, number_of_reviews, and review_scores_rating
df_mad.describe(["price", "number_of_reviews", "review_scores_rating"]).show()

+-------+-----------------+------------------+--------------------+
|summary|            price| number_of_reviews|review_scores_rating|
+-------+-----------------+------------------+--------------------+
|  count|            26765|             26765|               26765|
|   mean|136.8807771343172|44.096166635531475|   3.650783485895769|
| stddev|269.0122879453736| 83.49385005383789|  1.9611256892786988|
|    min|              1.0|               0.0|                 0.0|
|    max|          21000.0|            1060.0|                 5.0|
+-------+-----------------+------------------+--------------------+



## 2. Neighborhood Analysis:
### 2.1 Top Neighborhoods by Listing Count:

In [15]:
df_mad.groupBy("neighbourhood_group_cleansed").count().orderBy("count", ascending=False).show()

+----------------------------+-----+
|neighbourhood_group_cleansed|count|
+----------------------------+-----+
|                      Centro|11282|
|                    Chamberí| 1800|
|                   Salamanca| 1733|
|                      Tetuán| 1605|
|                  Arganzuela| 1404|
|                 Carabanchel|  988|
|                      Retiro|  941|
|               Ciudad Lineal|  925|
|                   Chamartín|  868|
|          Puente de Vallecas|  823|
|                      Latina|  773|
|           Moncloa - Aravaca|  727|
|                       Usera|  610|
|        San Blas - Canill...|  584|
|                   Hortaleza|  531|
|        Fuencarral - El P...|  412|
|                  Villaverde|  260|
|                   Moratalaz|  195|
|                     Barajas|  195|
|           Villa de Vallecas|  134|
+----------------------------+-----+
only showing top 20 rows



### 2.2 Average Price by Neighborhood:

In [16]:
df_mad.groupBy("neighbourhood_group_cleansed").agg({"price": "mean"}).orderBy("avg(price)", ascending=False).show()

+----------------------------+------------------+
|neighbourhood_group_cleansed|        avg(price)|
+----------------------------+------------------+
|                   Salamanca|176.63678426836321|
|                      Centro|157.34260001782056|
|                      Tetuán|145.87272727272727|
|                   Hortaleza|143.12428298279158|
|                  Arganzuela|141.11903064861013|
|                   Chamartín| 129.1600928074246|
|                    Chamberí|127.65850945494995|
|        San Blas - Canill...|125.90051457975986|
|                      Retiro|122.35501066098081|
|           Moncloa - Aravaca|121.35399449035813|
|          Puente de Vallecas|112.09268292682927|
|                     Barajas|100.93333333333334|
|               Ciudad Lineal| 98.63636363636364|
|        Fuencarral - El P...| 95.38199513381996|
|                 Carabanchel| 82.37348178137651|
|                       Usera| 73.89638157894737|
|           Villa de Vallecas| 72.67164179104478|


## 3. Price Distribution:

In [17]:
df_mad.select("price").summary().show()

+-------+-----------------+
|summary|            price|
+-------+-----------------+
|  count|            26765|
|   mean|136.8807771343172|
| stddev|269.0122879453736|
|    min|              1.0|
|    25%|             70.0|
|    50%|            112.0|
|    75%|            157.0|
|    max|          21000.0|
+-------+-----------------+



## 4.  Review Analysis:
### 4.1 Review Scores Distribution:

In [18]:
df_mad.groupBy("review_scores_rating").count().orderBy("review_scores_rating").show()

+--------------------+-----+
|review_scores_rating|count|
+--------------------+-----+
|                NULL|  108|
|                 0.0| 5816|
|                 1.0|   71|
|                 1.5|    2|
|                1.67|    1|
|                1.75|    1|
|                 2.0|   62|
|                2.33|    2|
|                 2.4|    1|
|                 2.5|   24|
|                 2.6|    1|
|                2.67|    6|
|                2.86|    1|
|                 3.0|  142|
|                3.13|    1|
|                3.14|    1|
|                3.17|    3|
|                 3.2|    4|
|                3.22|    1|
|                3.25|   11|
+--------------------+-----+
only showing top 20 rows



In [19]:
# 4.2 Average Review Scores by Neighborhood:
df_mad.groupBy("neighbourhood_group_cleansed").agg({"review_scores_rating": "mean"}).orderBy("avg(review_scores_rating)", ascending=False).show()

+----------------------------+-------------------------+
|neighbourhood_group_cleansed|avg(review_scores_rating)|
+----------------------------+-------------------------+
|                      Centro|        3.892414684130801|
|                   Hortaleza|       3.7833078393881463|
|                  Arganzuela|        3.743799002138276|
|          Puente de Vallecas|       3.6840975609756104|
|                      Retiro|        3.653518123667379|
|                     Barajas|       3.6524615384615378|
|                      Latina|        3.593059895833338|
|                 Carabanchel|        3.591072874493933|
|               Ciudad Lineal|       3.5661363636363643|
|                      Tetuán|        3.534714733542326|
|                   Salamanca|       3.4501098901098883|
|                       Usera|         3.43199013157895|
|        Fuencarral - El P...|        3.429878345498783|
|           Moncloa - Aravaca|        3.400303030303032|
|                   Chamartín| 

In [20]:
# There is no correlation between price and number of reviews

from pyspark.sql.functions import corr

df_mad.select(corr("price", "number_of_reviews").alias("correlation")).show()

+--------------------+
|         correlation|
+--------------------+
|-0.01622024258685433|
+--------------------+



## 5.  Amenities Analysis:
### Count of Listings with Specific Amenities:

In [21]:
df_mad.groupBy("kitchen", "air conditioning", "elevator", "patio or balcony").count().show()

+-------+----------------+--------+----------------+-----+
|kitchen|air conditioning|elevator|patio or balcony|count|
+-------+----------------+--------+----------------+-----+
|      0|               0|       0|               1|   95|
|      1|               0|       1|               1| 1094|
|      1|               1|       0|               0| 6111|
|      1|               1|       1|               1| 2091|
|      0|               1|       0|               1|   60|
|      0|               0|       0|               0|  703|
|   NULL|            NULL|    NULL|            NULL|  108|
|      1|               0|       0|               0| 5069|
|      0|               1|       1|               0|  496|
|      0|               0|       1|               1|   94|
|      0|               1|       1|               1|   54|
|      0|               0|       1|               0|  328|
|      0|               1|       0|               0|  669|
|      1|               0|       0|               1|  99

## 6. Room Type:

### Distribution of listings by Room Type:

In [22]:
df_mad.groupBy("room_type_encoded") \
    .count() \
    .withColumnRenamed("count", "listing_count") \
    .orderBy("listing_count", ascending=False) \
    .show()

+-----------------+-------------+
|room_type_encoded|listing_count|
+-----------------+-------------+
|                2|        17233|
|                1|         9151|
|                0|          381|
|             NULL|          108|
+-----------------+-------------+



### Average Price by Room Type:

In [23]:
df_mad.groupBy("room_type_encoded").agg({"price": "mean"}).orderBy("avg(price)", ascending=False).show()

+-----------------+-----------------+
|room_type_encoded|       avg(price)|
+-----------------+-----------------+
|                2|162.8450066732432|
|                0|104.8241469816273|
|                1| 89.3200743088187|
|             NULL|             NULL|
+-----------------+-----------------+



### Average Price by Room Type and Number of Bedroooms:

In [24]:
df_mad.groupBy("room_type_encoded", "bedrooms_encoded") \
    .agg({"price": "avg"}) \
    .withColumnRenamed("avg(price)", "avg_price") \
    .orderBy("room_type_encoded", "bedrooms_encoded") \
    .show()

+-----------------+----------------+-----------------+
|room_type_encoded|bedrooms_encoded|        avg_price|
+-----------------+----------------+-----------------+
|             NULL|            NULL|             NULL|
|                0|            NULL|              1.0|
|                0|               0|          102.608|
|                0|               1|            291.8|
|                1|               0|87.04572775486152|
|                1|               1|118.2957957957958|
|                2|               0|131.0345035710693|
|                2|               1|206.2114646187603|
+-----------------+----------------+-----------------+



## 7. Host Analysis:
### 7.1 Superhost vs. Price:

In [25]:
df_mad.groupBy("host_is_superhost").agg({"price": "mean"}).orderBy("avg(price)", ascending=False).show()

+-----------------+-----------------+
|host_is_superhost|       avg(price)|
+-----------------+-----------------+
|                f|137.4596291572648|
|                t|135.0518971464409|
|         40.43926|              1.0|
|             NULL|             NULL|
+-----------------+-----------------+



### 7.2 Host Listings Count:

In [26]:
df_mad.groupBy("calculated_host_listings_count").agg({"price": "mean"}).orderBy("avg(price)", ascending=False).show()

+------------------------------+------------------+
|calculated_host_listings_count|        avg(price)|
+------------------------------+------------------+
|                            75|350.49333333333334|
|                            10|288.82440476190476|
|                            43|284.48837209302326|
|                           115|  281.302752293578|
|                            50|            267.08|
|                            89| 227.1123595505618|
|                            63|226.36507936507937|
|                            36|215.01388888888889|
|                            34|213.51470588235293|
|                            59|203.38983050847457|
|                           289|203.17301038062283|
|                            47| 183.7659574468085|
|                            30|178.23333333333332|
|                            28|174.74698795180723|
|                            16|             174.5|
|                            23|             171.0|
|           

## Geographic Distribution of Listings:

In [27]:
# Collect latitude and longitude data for visualization
geo_data = df_mad.select("latitude", "longitude").toPandas()

# Save to CSV or another format for use in visualization tools
geo_data.to_csv("geographic_distribution.csv", index=False)

In [28]:
# Stop the Spark session
spark.stop()