# Get the data

In this notebook we will focus on **Listings data**  

**Listings data** is a dataset that includes information about properties listed on Airbnb. This dataset includes details about the
* location,
* price,
* availability, and
* amenities of each property.

Here are some examples of the types of information that might be included in the listings data:

* **Property type**: Is the property a private room, shared room, or entire home/apartment?

* **Location**: Where is the property located (e.g., city, state, country)?

* **Price**: How much does it cost to stay at the property?

* **Availability**: When is the property available for booking?

* **Amenities**: What amenities are available at the property (e.g., WiFi, laundry facilities, kitchen)?

* **Photos**: What does the property look like (e.g., interior, exterior)?

Listings data can be useful for understanding the types of properties that are available on Airbnb, as well as the prices and amenities offered. It can also be used to analyze trends in the hospitality industry and understand the demand for different types of accommodations. For example, researchers might use listings data to study the impact of Airbnb on the housing market in a particular city, or to **understand the factors that influence the popularity of different types of properties**.


In [4]:
# Install wget
!pip install wget

# Remove airbnb_data folder (if exists)
!rm -rf airbnb_data



In [6]:
# Define list of data links to download
urls_clean = ['https://data.insideairbnb.com/united-states/tx/austin/2024-09-13/data/listings.csv.gz',
 'https://data.insideairbnb.com/thailand/central-thailand/bangkok/2024-09-25/data/listings.csv.gz',
 'https://data.insideairbnb.com/spain/catalonia/barcelona/2024-09-06/data/listings.csv.gz',
 'https://data.insideairbnb.com/united-states/fl/broward-county/2024-09-23/data/listings.csv.gz',
 'https://data.insideairbnb.com/argentina/ciudad-autónoma-de-buenos-aires/buenos-aires/2024-11-29/data/listings.csv.gz',
 'https://data.insideairbnb.com/south-africa/wc/cape-town/2024-09-25/data/listings.csv.gz',
 'https://data.insideairbnb.com/united-states/nv/clark-county-nv/2024-09-18/data/listings.csv.gz',
 'https://data.insideairbnb.com/denmark/hovedstaden/copenhagen/2024-09-28/data/listings.csv.gz',
 'https://data.insideairbnb.com/greece/crete/crete/2024-09-26/data/listings.csv.gz',
 'https://data.insideairbnb.com/spain/catalonia/girona/2024-09-29/data/listings.csv.gz',
 'https://data.insideairbnb.com/united-states/hi/hawaii/2024-09-13/data/listings.csv.gz',
 'https://data.insideairbnb.com/turkey/marmara/istanbul/2024-09-28/data/listings.csv.gz',
 'https://data.insideairbnb.com/portugal/lisbon/lisbon/2024-09-14/data/listings.csv.gz',
 'https://data.insideairbnb.com/united-kingdom/england/london/2024-09-06/data/listings.csv.gz',
 'https://data.insideairbnb.com/united-states/ca/los-angeles/2024-09-04/data/listings.csv.gz',
 'https://data.insideairbnb.com/spain/comunidad-de-madrid/madrid/2024-09-11/data/listings.csv.gz',
 'https://data.insideairbnb.com/spain/islas-baleares/mallorca/2024-09-13/data/listings.csv.gz',
 'https://data.insideairbnb.com/australia/vic/melbourne/2024-09-05/data/listings.csv.gz',
 'https://data.insideairbnb.com/mexico/df/mexico-city/2024-09-25/data/listings.csv.gz',
 'https://data.insideairbnb.com/italy/lombardy/milan/2024-09-17/data/listings.csv.gz',
 'https://data.insideairbnb.com/united-states/ny/new-york-city/2024-11-04/data/listings.csv.gz',
 'https://data.insideairbnb.com/france/ile-de-france/paris/2024-09-06/data/listings.csv.gz',
 'https://data.insideairbnb.com/italy/puglia/puglia/2024-09-29/data/listings.csv.gz',
 'https://data.insideairbnb.com/brazil/rj/rio-de-janeiro/2024-09-25/data/listings.csv.gz',
 'https://data.insideairbnb.com/italy/lazio/rome/2024-09-11/data/listings.csv.gz',
 'https://data.insideairbnb.com/italy/sicilia/sicily/2024-09-28/data/listings.csv.gz',
 'https://data.insideairbnb.com/greece/south-aegean/south-aegean/2024-09-20/data/listings.csv.gz',
 'https://data.insideairbnb.com/australia/nsw/sydney/2024-09-05/data/listings.csv.gz',
 'https://data.insideairbnb.com/japan/kantō/tokyo/2024-09-27/data/listings.csv.gz',
 'https://data.insideairbnb.com/canada/on/toronto/2024-11-07/data/listings.csv.gz']

# Import libraries
import wget
import os

for url in urls_clean:
  url_split = url.split("/")
  if url_split[5] == 'data':
    print(f'{url}\n - skipped')
    continue
  city = url_split[3] if url_split[5] == 'data' else url_split[5]
  out_dir = f"airbnb_data/{city}"
  os.makedirs(out_dir, exist_ok=False)
  print(f"{url}\n - downloaded to: {out_dir}/{url_split[-1]}")
  filename = wget.download(url, out=f"{out_dir}/{url_split[-1]}")

https://data.insideairbnb.com/united-states/tx/austin/2024-09-13/data/listings.csv.gz
 - downloaded to: airbnb_data/austin/listings.csv.gz
100% [......................................................] 8902975 / 8902975https://data.insideairbnb.com/thailand/central-thailand/bangkok/2024-09-25/data/listings.csv.gz
 - downloaded to: airbnb_data/bangkok/listings.csv.gz
100% [....................................................] 12630227 / 12630227https://data.insideairbnb.com/spain/catalonia/barcelona/2024-09-06/data/listings.csv.gz
 - downloaded to: airbnb_data/barcelona/listings.csv.gz
100% [......................................................] 9977937 / 9977937https://data.insideairbnb.com/united-states/fl/broward-county/2024-09-23/data/listings.csv.gz
 - downloaded to: airbnb_data/broward-county/listings.csv.gz
100% [......................................................] 9989897 / 9989897https://data.insideairbnb.com/argentina/ciudad-autónoma-de-buenos-aires/buenos-aires/2024-11-29/

In [8]:
# Print list of folders and files in current dir
!ls airbnb_data

[34maustin[m[m          [34mclark-county-nv[m[m [34mlisbon[m[m          [34mmexico-city[m[m     [34mrome[m[m
[34mbangkok[m[m         [34mcopenhagen[m[m      [34mlondon[m[m          [34mmilan[m[m           [34msicily[m[m
[34mbarcelona[m[m       [34mcrete[m[m           [34mlos-angeles[m[m     [34mnew-york-city[m[m   [34msouth-aegean[m[m
[34mbroward-county[m[m  [34mgirona[m[m          [34mmadrid[m[m          [34mparis[m[m           [34msydney[m[m
[34mbuenos-aires[m[m    [34mhawaii[m[m          [34mmallorca[m[m        [34mpuglia[m[m          [34mtokyo[m[m
[34mcape-town[m[m       [34mistanbul[m[m        [34mmelbourne[m[m       [34mrio-de-janeiro[m[m  [34mtoronto[m[m


In [10]:
# Check folders/files size
!du -h airbnb_data

9.5M	airbnb_data/barcelona
 49M	airbnb_data/paris
9.5M	airbnb_data/broward-county
 13M	airbnb_data/milan
8.5M	airbnb_data/austin
 14M	airbnb_data/melbourne
7.4M	airbnb_data/clark-county-nv
 23M	airbnb_data/puglia
9.2M	airbnb_data/mallorca
 11M	airbnb_data/tokyo
 19M	airbnb_data/new-york-city
 12M	airbnb_data/istanbul
9.9M	airbnb_data/sydney
 13M	airbnb_data/lisbon
 13M	airbnb_data/cape-town
 14M	airbnb_data/madrid
 19M	airbnb_data/rio-de-janeiro
 15M	airbnb_data/crete
 14M	airbnb_data/mexico-city
 11M	airbnb_data/girona
 11M	airbnb_data/toronto
 19M	airbnb_data/hawaii
 49M	airbnb_data/london
 27M	airbnb_data/los-angeles
 29M	airbnb_data/sicily
 19M	airbnb_data/rome
 17M	airbnb_data/south-aegean
 10M	airbnb_data/copenhagen
 12M	airbnb_data/bangkok
 19M	airbnb_data/buenos-aires
506M	airbnb_data


# Data import

## Spark

In [14]:
# Install pyspark
!pip install pyspark -qq

# Import package and setup local spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/01/23 17:22:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [16]:
# Read data - in spark we can define only regular expresion like in dir
# there is no need to refer to each file
listings_df_raw = spark.read.option('multiLine', 'True')\
    .option('escape', '"').option("mode", "DROPMALFORMED")\
    .csv('airbnb_data/*/*.gz', header=True, inferSchema = True)

                                                                                

In [18]:
num_rows = listings_df_raw.count()
print(f"Number of rows: {num_rows}")



Number of rows: 976524


                                                                                

In [20]:
# Check nuber of rows per input file
from pyspark.sql.functions import input_file_name
listings_df_raw.withColumn("input_file", input_file_name()).groupBy(['input_file']).count().toPandas()

                                                                                

Unnamed: 0,input_file,count
0,file:///Users/michalkoperski/Downloads/aaa/air...,95461
1,file:///Users/michalkoperski/Downloads/aaa/air...,96182
2,file:///Users/michalkoperski/Downloads/aaa/air...,45533
3,file:///Users/michalkoperski/Downloads/aaa/air...,60515
4,file:///Users/michalkoperski/Downloads/aaa/air...,35443
5,file:///Users/michalkoperski/Downloads/aaa/air...,48818
6,file:///Users/michalkoperski/Downloads/aaa/air...,35295
7,file:///Users/michalkoperski/Downloads/aaa/air...,37548
8,file:///Users/michalkoperski/Downloads/aaa/air...,34061
9,file:///Users/michalkoperski/Downloads/aaa/air...,36967


In [22]:
# Print list of columns
print(listings_df_raw.columns)

['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name', 'description', 'neighborhood_overview', 'picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude', 'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availabil

In [24]:
# Show top 10 rows of selected columns
listings_df_raw.select('neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed','has_availability','availability_30','availability_60', 'availability_90', 'availability_365','price').show(10)

+--------------------+----------------------+----------------------------+----------------+---------------+---------------+---------------+----------------+-------+
|       neighbourhood|neighbourhood_cleansed|neighbourhood_group_cleansed|has_availability|availability_30|availability_60|availability_90|availability_365|  price|
+--------------------+----------------------+----------------------------+----------------+---------------+---------------+---------------+----------------+-------+
|Neighborhood high...|          Observatoire|                        NULL|               t|              6|             20|             35|             297|$113.00|
|Neighborhood high...|        Hôtel-de-Ville|                        NULL|               t|              3|             24|             54|              77| $95.00|
|                NULL|        Hôtel-de-Ville|                        NULL|               t|              6|             30|             49|             316|$145.00|
|         

In [26]:
# Take first row form spark DataFrame
listings_df_raw.take(5)

25/01/23 17:23:59 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


[Row(id=3109, listing_url='https://www.airbnb.com/rooms/3109', scrape_id=20240906025355, last_scraped=datetime.date(2024, 9, 11), source='city scrape', name='zen and calm', description='Lovely Appartment with one bedroom with a Queen size bed.<br />Calm et bright<br />Sheets and towels furnished<br />4th floor, no elevator', neighborhood_overview='Good restaurants<br />very close the Montparnasse Station<br />15 m from the center of Paris', picture_url='https://a0.muscache.com/pictures/miso/Hosting-3109/original/50c69430-6385-413b-8a65-f6c022912301.jpeg', host_id=3631, host_url='https://www.airbnb.com/users/show/3631', host_name='Anne', host_since=datetime.date(2008, 10, 14), host_location='Paris, France', host_about=None, host_response_time='N/A', host_response_rate='N/A', host_acceptance_rate='67%', host_is_superhost='f', host_thumbnail_url='https://a0.muscache.com/im/users/3631/profile_pic/1375800198/original.jpg?aki_policy=profile_small', host_picture_url='https://a0.muscache.com/i

In [29]:
# show first 10
listings_df_raw.select('price').show(10)

+-------+
|  price|
+-------+
|$113.00|
| $95.00|
|$145.00|
|   NULL|
|$450.00|
|   NULL|
|   NULL|
|   NULL|
| $75.00|
|$246.00|
+-------+
only showing top 10 rows



In [31]:
listings_df = listings_df_raw.dropna(how='all', subset=['price'])

In [33]:
num_rows = listings_df.count()
print(f"Number of rows: {num_rows}")



Number of rows: 797480


                                                                                

In [35]:
listings_df.select('price').show(10)

+-------+
|  price|
+-------+
|$113.00|
| $95.00|
|$145.00|
|$450.00|
| $75.00|
|$246.00|
| $80.00|
| $75.00|
| $80.00|
|$124.00|
+-------+
only showing top 10 rows



In [37]:
# change from text to float
import pyspark.sql.functions as F
listings_df = listings_df.withColumn('price_new', F.regexp_replace('price', '\$', ''))
listings_df = listings_df.withColumn("price_new_float", F.col("price_new").cast("float"))

  listings_df = listings_df.withColumn('price_new', F.regexp_replace('price', '\$', ''))


In [39]:
# show top 10 data rows
listings_df.show(10)

+------+--------------------+--------------+------------+-----------+--------------------+--------------------+---------------------+--------------------+-------+--------------------+----------+----------+-------------+--------------------+------------------+------------------+--------------------+-----------------+--------------------+--------------------+--------------------+-------------------+-------------------------+--------------------+--------------------+----------------------+--------------------+----------------------+----------------------------+-----------------+-----------------+------------------+---------------+------------+---------+--------------+--------+----+--------------------+-------+--------------+--------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------------+----------------+----------------+---------------+---------------+---------------+----------------+-----------------

In [41]:
max_value = listings_df.agg(max(F.col("price_new_float"))).collect()[0][0]
print(f"Maximum value in the column: {max_value}")

PySparkTypeError: [NOT_ITERABLE] Column is not iterable.

# Data preparation

In [44]:
from pyspark.ml.feature import VectorAssembler

vectorAssembler = VectorAssembler(
    inputCols = ['accommodates',
                 'latitude',
                  'longitude',
                 'bedrooms',
                 'beds',
                 'number_of_reviews',
                 'review_scores_rating',
                 'reviews_per_month'],
    outputCol = 'features', handleInvalid='skip')

v_df = vectorAssembler.transform(listings_df)
v_df = v_df.select(['features', 'price_new_float'])

splits = v_df.randomSplit([0.7, 0.3])
train_df = splits[0].na.drop()
test_df = splits[1].na.drop()

In [46]:
train_df.show(10)

[Stage 16:>                                                         (0 + 1) / 1]

+--------------------+---------------+
|            features|price_new_float|
+--------------------+---------------+
|[1.0,48.81629,2.3...|           51.0|
|[1.0,48.81762,2.3...|           27.0|
|[1.0,48.82003,2.3...|           54.0|
|[1.0,48.820573431...|          120.0|
|[1.0,48.82062,2.3...|          100.0|
|[1.0,48.82093,2.3...|           39.0|
|[1.0,48.82201,2.3...|           42.0|
|[1.0,48.82212,2.3...|           65.0|
|[1.0,48.822366988...|           40.0|
|[1.0,48.822606921...|           60.0|
+--------------------+---------------+
only showing top 10 rows



                                                                                

# Modelling

## Linear model

### Model training

In [51]:
from pyspark.ml.regression import LinearRegression

# Load training data
lr = LinearRegression(featuresCol = 'features',
                      labelCol='price_new_float',
                      maxIter=10,
                      regParam=0.3,
                      elasticNetParam=0.8)

# Fit the model
lrModel = lr.fit(train_df)

25/01/23 17:24:58 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
                                                                                

### Model evaluation

In [54]:
# Print the coefficients and intercept for linear regression
print(lrModel)
def round_(x): return f'{x:.2f}'
print(f'Intercept: \n\t {round_(lrModel.intercept)}')
print(f'Coefficients:\n\t {list(map(round_, list(lrModel.coefficients)))}')

# Summarize the model over the training set and print out some metrics
trainingSummary = lrModel.summary
print(f'numIterations: \n\t {lrModel.summary.totalIterations}')
print(f'objectiveHistory: \n\t {list(map(round_, list(lrModel.summary.objectiveHistory))) }')
print(f'RMSE:\n\t {round_(lrModel.summary.rootMeanSquaredError)}')
print(f'r-squared:\n\t {round_(lrModel.summary.r2)}')

print('Residuals:\n ')
lrModel.summary.residuals.show()

LinearRegressionModel: uid=LinearRegression_f913ce8769e5, numFeatures=8
Intercept: 
	 102.75
Coefficients:
	 ['12.81', '-2.29', '-0.36', '34.83', '-6.25', '-0.03', '17.51', '-3.31']
numIterations: 
	 10
objectiveHistory: 
	 ['0.50', '0.48', '0.42', '0.42', '0.41', '0.41', '0.41', '0.41', '0.41', '0.41', '0.41']
RMSE:
	 168.99
r-squared:
	 0.18
Residuals:
 


[Stage 21:>                                                         (0 + 1) / 1]

+-------------------+
|          residuals|
+-------------------+
| -56.11222739372387|
|  -86.7168411511918|
|-22.155600988916092|
|  13.99053183295733|
|-14.057083881404793|
|  -66.3190352064519|
| -68.19806929123183|
|-40.871915253663815|
| -39.09146835365971|
| -53.78012609695503|
| -34.29246169034923|
|  47.88150000175908|
|  18.68202982392978|
|-30.311252185657878|
| -45.79313201802778|
| -52.50212651742444|
| -42.56890808907528|
|   -73.473808350751|
| -56.42721211423722|
| -5.756821288334038|
+-------------------+
only showing top 20 rows



                                                                                

In [55]:
from pyspark.ml.evaluation import RegressionEvaluator

print(lrModel)  # summary only

# Train predictions
train_predictions = lrModel.transform(train_df)
train_predictions.select("prediction", "price_new_float", "features").show(5)
train_evaluator = RegressionEvaluator(
    labelCol="price_new_float",
    predictionCol="prediction",
    metricName="rmse")
train_rmse = train_evaluator.evaluate(train_predictions)
print("Root Mean Squared Error (RMSE) on train data = %g" % train_rmse)

# Test predictions
test_predictions = lrModel.transform(test_df)

# Select example rows to display.
test_predictions.select("prediction", "price_new_float", "features").show(5)

# Select (prediction, true label) and compute test error
test_evaluator = RegressionEvaluator(
    labelCol="price_new_float",
    predictionCol="prediction",
    metricName="rmse")
test_rmse = test_evaluator.evaluate(test_predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % test_rmse)

LinearRegressionModel: uid=LinearRegression_f913ce8769e5, numFeatures=8


                                                                                

+------------------+---------------+--------------------+
|        prediction|price_new_float|            features|
+------------------+---------------+--------------------+
|107.11222739372387|           51.0|[1.0,48.81629,2.3...|
| 113.7168411511918|           27.0|[1.0,48.81762,2.3...|
| 76.15560098891609|           54.0|[1.0,48.82003,2.3...|
|106.00946816704267|          120.0|[1.0,48.820573431...|
| 114.0570838814048|          100.0|[1.0,48.82062,2.3...|
+------------------+---------------+--------------------+
only showing top 5 rows



                                                                                

Root Mean Squared Error (RMSE) on train data = 168.99


                                                                                

+------------------+---------------+--------------------+
|        prediction|price_new_float|            features|
+------------------+---------------+--------------------+
|117.33253804176522|           45.0|[1.0,48.817996216...|
|  119.567004320525|           75.0|[1.0,48.81964,2.3...|
|116.05118707350368|           65.0|[1.0,48.82295,2.3...|
|116.77520078249765|           70.0|[1.0,48.8241,2.34...|
| 256.6481468903319|           27.0|[1.0,48.82482,2.3...|
+------------------+---------------+--------------------+
only showing top 5 rows





Root Mean Squared Error (RMSE) on test data = 168.925


                                                                                

## Random Forest

### Model training

In [62]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator

In [64]:
from pyspark.ml.regression import RandomForestRegressor

# Train a RandomForest model.
rf = RandomForestRegressor(featuresCol="features",
                           labelCol = 'price_new_float',
                           maxDepth = 5, numTrees = 100)

# Train model.  This also runs the indexer.
rfModel = rf.fit(train_df)

                                                                                

### Model evaluation

In [67]:
from pyspark.ml.evaluation import RegressionEvaluator

print(rfModel)  # summary only

# Train predictions
train_predictions = rfModel.transform(train_df)
train_predictions.select("prediction", "price_new_float", "features").show(5)
train_evaluator = RegressionEvaluator(
    labelCol="price_new_float",
    predictionCol="prediction",
    metricName="rmse")
train_rmse = train_evaluator.evaluate(train_predictions)
print("Root Mean Squared Error (RMSE) on train data = %g" % train_rmse)

# Test predictions
test_predictions = rfModel.transform(test_df)

# Select example rows to display.
test_predictions.select("prediction", "price_new_float", "features").show(5)

# Select (prediction, true label) and compute test error
test_evaluator = RegressionEvaluator(
    labelCol="price_new_float", predictionCol="prediction", metricName="rmse")
test_rmse = test_evaluator.evaluate(test_predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % test_rmse)

RandomForestRegressionModel: uid=RandomForestRegressor_71f5415f4b43, numTrees=100, numFeatures=8


                                                                                

+------------------+---------------+--------------------+
|        prediction|price_new_float|            features|
+------------------+---------------+--------------------+
| 120.5328256477612|           51.0|[1.0,48.81629,2.3...|
|120.45383051801012|           27.0|[1.0,48.81762,2.3...|
| 119.9327775542346|           54.0|[1.0,48.82003,2.3...|
|118.88177022763035|          120.0|[1.0,48.820573431...|
|120.07214414022415|          100.0|[1.0,48.82062,2.3...|
+------------------+---------------+--------------------+
only showing top 5 rows



                                                                                

Root Mean Squared Error (RMSE) on train data = 138.507


                                                                                

+------------------+---------------+--------------------+
|        prediction|price_new_float|            features|
+------------------+---------------+--------------------+
|121.20472691185765|           45.0|[1.0,48.817996216...|
| 121.5036201423922|           75.0|[1.0,48.81964,2.3...|
|120.37525181089202|           65.0|[1.0,48.82295,2.3...|
|121.20472691185765|           70.0|[1.0,48.8241,2.34...|
| 245.3837618735119|           27.0|[1.0,48.82482,2.3...|
+------------------+---------------+--------------------+
only showing top 5 rows





Root Mean Squared Error (RMSE) on test data = 138.823


                                                                                

# EDA/Feature enineering


## City column

In [88]:
# # Add city column (based on file dir)
from pyspark.sql.functions import split, col

listings_df = listings_df\
    .withColumn("input_file", input_file_name())\
    .withColumn("city", split(col('input_file'), '/').getItem(5))

In [90]:
listings_df.select('input_file','city').show(10, truncate = False)

+----------------------------------------------------------------------------+---------+
|input_file                                                                  |city     |
+----------------------------------------------------------------------------+---------+
|file:///Users/michalkoperski/Downloads/aaa/airbnb_data/paris/listings.csv.gz|Downloads|
|file:///Users/michalkoperski/Downloads/aaa/airbnb_data/paris/listings.csv.gz|Downloads|
|file:///Users/michalkoperski/Downloads/aaa/airbnb_data/paris/listings.csv.gz|Downloads|
|file:///Users/michalkoperski/Downloads/aaa/airbnb_data/paris/listings.csv.gz|Downloads|
|file:///Users/michalkoperski/Downloads/aaa/airbnb_data/paris/listings.csv.gz|Downloads|
|file:///Users/michalkoperski/Downloads/aaa/airbnb_data/paris/listings.csv.gz|Downloads|
|file:///Users/michalkoperski/Downloads/aaa/airbnb_data/paris/listings.csv.gz|Downloads|
|file:///Users/michalkoperski/Downloads/aaa/airbnb_data/paris/listings.csv.gz|Downloads|
|file:///Users/michal