**Creating Housing DF**
StructType is commonly used to define the schema when creating a DataFrame, particularly for structured data with fields of different data types.
StructType represents a schema, which is a collection of StructField objects. A StructType is essentially a list of fields, each with a name and data type, defining the structure of the DataFrame.
pyspark.sql.types.StructType(fields=None)
pyspark.sql.types.StructField(name, datatype,nullable=True)

In [0]:
from pyspark.sql.types import DoubleType, StringType, StructType, StructField

schema = StructType([
            StructField("longitude",DoubleType(),True),
            StructField("latitude",DoubleType(),True),
            StructField("housing_median_age",DoubleType(),True),
            StructField("total_rooms",DoubleType(),True),
            StructField("total_bedrooms",DoubleType(),True),
            StructField("population",DoubleType(),True),
            StructField("households",DoubleType(),True),
            StructField("median_income",DoubleType(),True),
            StructField("median_house_value",DoubleType(),True),
            StructField("ocean_proximity",StringType(),True)
            ])
housing_df = spark.read.csv("/FileStore/tables/housing.csv",schema=schema)  

In [0]:
housing_df.display()

EDA

In [0]:
housing_df.count()

In [0]:
housing_df.columns

In [0]:
type(housing_df)

In [0]:
train_df, test_df = housing_df.randomSplit([0.8, 0.2], seed=12345)

Create AutoML Model

In [0]:
from databricks import automl

In [0]:
summary = automl.regress(train_df, target_col="median_house_value", timeout_minutes=20)

Test the model

Predict the median house value in the 20% of test data

In [0]:
import mlflow

model_uri = f"runs:/{summary.best_trial.mlflow_run_id}/model"

predict = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri, result_type="double")
pred_df = test_df.withColumn("prediction", predict(*test_df.drop("median_house_value").columns))
display(pred_df)

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

regressionEvaluator = RegressionEvaluator(predictionCol="prediction", labelCol="median_house_value", metricName="r2")
r2_score = regressionEvaluator.evaluate(pred_df)

print(f"Val_r2_score on test data {r2_score}")

