## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# File location and type
file_location = "/FileStore/tables/ConcreteStrengthData-1.csv"
file_type = "csv"


df = spark.read.csv(file_location,header=True,inferSchema=True)

In [0]:
df.printSchema()

In [0]:
df.show()

In [0]:
df.printSchema()

In [0]:
df.columns

In [0]:
from pyspark.ml.feature import StringIndexer

In [0]:
from pyspark.ml.feature import VectorAssembler

featureassembler=VectorAssembler(inputCols=['CementComponent ','BlastFurnaceSlag','FlyAshComponent','WaterComponent','SuperplasticizerComponent','CoarseAggregateComponent','FineAggregateComponent','AgeInDays'],outputCol="Independent Features")
output=featureassembler.transform(df)


In [0]:
output.select('Independent Features').show()

In [0]:
output.show()

In [0]:
finalized_data=output.select("Independent Features","Strength")

In [0]:
finalized_data.show()

In [0]:

from pyspark.ml.regression import DecisionTreeRegressor,DecisionTreeRegressionModel
##train test split
train_data,test_data=finalized_data.randomSplit([0.75,0.25])
regressor=DecisionTreeRegressor(featuresCol='Independent Features', labelCol='Strength')
regressor=regressor.fit(train_data)
dt_predictions = regressor.transform(test_data)

In [0]:
dt_predictions.select("prediction","Strength","Independent Features").show(5)

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

dt_evaluator = RegressionEvaluator(labelCol="Strength", predictionCol="prediction", metricName="rmse")
rmse = dt_evaluator.evaluate(dt_predictions)
print("Root Mean Squared Error (RMSE) on test data = %g" % rmse)

In [0]:
regressor.featureImportances