### Airbnb Dataset Analysis and Regression Modeling with Spark

This code snippet illustrates a simplified data analysis and regression modeling workflow using the Airbnb dataset with Apache Spark. The dataset is loaded in Parquet format, and basic preprocessing steps are undertaken to ready it for regression modeling. Two regression models, Linear Regression and RandomForestRegressor, are employed for predicting the "price" variable.

**Dataset Source and Copyright:**
The data for this analysis is sourced from [Inside Airbnb](http://insideairbnb.com/get-the-data.html). All copyrights and ownership of the data belong to the respective owner.

**Dataset and Preprocessing:**
The Airbnb dataset is loaded and explored, revealing its schema. Categorical and numeric columns are identified for subsequent preprocessing. Numeric features undergo imputation, vectorization, and scaling, while categorical features are indexed, one-hot encoded, and combined with the scaled numeric features.

**Regression Modeling:**
Two regression models, Linear Regression and RandomForestRegressor, are trained on the preprocessed data. Predictions are made on a test set, and the root mean square error (RMSE) is computed to evaluate the model performance.

- Linear Regression RMSE: 220.6
- Random Forest RMSE: 207.7

**MLflow Integration:**
MLflow is introduced to log the model parameters, metrics, and an artifact containing feature importance scores. The logged information is organized within an MLflow run for easy tracking and reproducibility.

**Hyperparameter Tuning with Grid Search:**
A simple hyperparameter tuning approach using grid search is demonstrated for the RandomForestRegressor. Different combinations of hyperparameters are tested using cross-validation to identify the best set of values. The resulting model with the optimal hyperparameters is then evaluated on the test set.

- Random Forest with Cross-Validation RMSE: 202.9

**Best Random Forest Parameters:**
- bootstrap: True
- cacheNodeIds: False
- checkpointInterval: 10
- featureSubsetStrategy: auto
- featuresCol: features
- impurity: variance
- labelCol: price
- leafCol: 
- maxBins: 32
- maxDepth: 15
- maxMemoryInMB: 256
- minInfoGain: 0.0
- minInstancesPerNode: 5
- minWeightFractionPerNode: 0.0
- numTrees: 100
- predictionCol: prediction
- seed: 42
- subsamplingRate: 1.0

This example serves as a basic introduction to Spark and MLflow, providing a foundation for more advanced analyses and machine learning tasks.

In [None]:
# Spark initialization
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, FloatType, IntegerType, StringType
from pyspark.ml.feature import VectorAssembler, StandardScaler, Imputer, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression, RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.functions import col, expr

# Initialize Spark session
spark = SparkSession.builder.appName("sparkAirbnb").getOrCreate()

# Load the dataset
datasetPath = "airbnb"
airbnbDF = spark.read.format("parquet").load(datasetPath)

# Display the schema
airbnbDF.printSchema()

# Identify categorical and numeric columns
categoricalCols = [field for (field, dataType) in airbnbDF.dtypes if dataType == "string"]
numericCols = [field for (field, dataType) in airbnbDF.dtypes if ((dataType == "double") and (field != "price"))]

# Numeric pipeline
imputer_numeric = Imputer(strategy="median", inputCols=numericCols, outputCols=["{}_imputed".format(col) for col in numericCols])
assembler_numeric = VectorAssembler(inputCols=["{}_imputed".format(col) for col in numericCols], outputCol="numeric_features")
scaler = StandardScaler(inputCol="numeric_features", outputCol="scaled_features", withStd=True, withMean=True)
numeric_pipeline = Pipeline(stages=[imputer_numeric, assembler_numeric, scaler])
numeric_transformed_df = numeric_pipeline.fit(airbnbDF).transform(airbnbDF)

# Categorical pipeline
indexOutputCols = [x + "Index" for x in categoricalCols]
oheOutputCols = [x + "OHE" for x in categoricalCols]
stringIndexer = StringIndexer(inputCols=categoricalCols, outputCols=indexOutputCols, handleInvalid="skip")
oheEncoder = OneHotEncoder(inputCols=indexOutputCols, outputCols=oheOutputCols)
assemblerInputs = oheOutputCols + ["scaled_features",]
vecAssembler = VectorAssembler(inputCols=assemblerInputs, outputCol="features")
combined_pipeline = Pipeline(stages=[numeric_pipeline, stringIndexer, oheEncoder, vecAssembler])
transformedDF = combined_pipeline.fit(airbnbDF).transform(airbnbDF)

# Split into train and test sets
trainDF, testDF = transformedDF.randomSplit([0.8, 0.2], seed=42)

# Linear Regression
lr = LinearRegression(featuresCol="features", labelCol="price")
lr_model = lr.fit(trainDF)

# RandomForestRegressor
rf = RandomForestRegressor(featuresCol='features', labelCol='price', numTrees=100, seed=42)
rf_model = rf.fit(trainDF)

# Make predictions on the test set
predDF_lr = lr_model.transform(testDF)
predDF_rf = rf_model.transform(testDF)

# Display predictions
predDF_lr.select("features", "price", "prediction").withColumn("prediction", expr("abs(prediction)")).show(10)
predDF_rf.select("features", "price", "prediction").withColumn("prediction", expr("abs(prediction)")).show(10)

# Evaluate models
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")
rmse_lr = evaluator.evaluate(predDF_lr)
rmse_rf = evaluator.evaluate(predDF_rf)

print(f"Linear Regression RMSE: {rmse_lr:.1f}")
print(f"Random Forest RMSE: {rmse_rf:.1f}")

# MLflow logging
import os
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
import mlflow
import mlflow.spark
import pandas as pd

# Log parameters and metrics with MLflow
with mlflow.start_run(run_name="random-forest") as run:
    # Log parameters
    mlflow.log_param("model", "RandomForestRegressor")
    mlflow.log_params({param.name: value for param, value in rf_model.extractParamMap().items()})
    
    # Log model
    mlflow.spark.log_model(rf_model, "random_forest_model")

    # Log metrics
    evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")
    rmse_lr = evaluator.evaluate(lr_model.transform(testDF))
    mlflow.log_metric("rmse_lr", rmse_lr)
    
    rmse_rf = evaluator.evaluate(rf_model.transform(testDF))
    mlflow.log_metric("rmse_rf", rmse_rf)

    # Log artifact: feature importance scores
    rfModel = rf_model
    pandasDF = (pd.DataFrame(list(zip(vecAssembler.getInputCols(), rfModel.featureImportances)), 
                             columns=["feature", "importance"])
                .sort_values(by="importance", ascending=False))
    
    pandasDF.to_csv("data/feature-importance.csv", index=False)
    mlflow.log_artifact("data/feature-importance.csv")

# Simple fine-tuning
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Define the parameter grid for Random Forest
param_grid_rf = (
    ParamGridBuilder() 
    .addGrid(rf.maxDepth, [5, 10, 15]) 
    .addGrid(rf.minInstancesPerNode, [1, 5, 10]) 
    .build()
)

# Cross-validation with the evaluator
crossval_rf = CrossValidator(estimator=rf,
                             estimatorParamMaps=param_grid_rf,
                             evaluator=evaluator,
                             numFolds=3)

# Fit the Random Forest model with cross-validation on the training data
cv_model_rf = crossval_rf.fit(trainDF)
# Make predictions on the test set using the best Random Forest model
predDF_rf_cv = cv_model_rf.transform(testDF)

predDF_rf_cv.select("features", "price", "prediction").withColumn("prediction", expr("abs(prediction)")).show(10)
rmse_rf_cv = evaluator.evaluate(predDF_rf_cv)

print(f"Random Forest with Cross-Validation RMSE: {rmse_rf_cv:.1f}")

best_params_rf = cv_model_rf.bestModel.extractParamMap()
print("Best Random Forest Parameters:")
for param, value in best_params_rf.items():
    print(f"{param.name}: {value}")
