
## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# File location and type
file_location = "/FileStore/tables/tips.csv"
file_type = "csv"

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.csv(file_location,inferSchema=True,header=True)
df.show()

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2|
|     18.43| 3.0|  Male|    No|Sun|Dinner|   4|
|     14.83|3.02|Female|    No|Sun|Dinner|   2|
|     21.58|3.92|  Male|    No|Sun|Dinner|   2|
|     10.33|1.67|Female|    No|Sun|Dinner|   3|
|     16.29|3.71|  Male|    No|Sun|Dinne

In [0]:
# Dependent Feature = total_bill and Rest are independent
df.columns

['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']

In [0]:
# Converting Categorical Features into Numerical
from pyspark.ml.feature import StringIndexer

indexer=StringIndexer(inputCols=["sex","smoker","day","time"],outputCols=["sex_indexed","smoker_indexed","day_indexed","time_index"])
df_r=indexer.fit(df).transform(df)
df_r.show()

+----------+----+------+------+---+------+----+-----------+--------------+-----------+----------+
|total_bill| tip|   sex|smoker|day|  time|size|sex_indexed|smoker_indexed|day_indexed|time_index|
+----------+----+------+------+---+------+----+-----------+--------------+-----------+----------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|        1.0|           0.0|        1.0|       0.0|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|       0.0|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|       0.0|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|        0.0|           0.0|        1.0|       0.0|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|        1.0|           0.0|        1.0|       0.0|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|        0.0|           0.0|        1.0|       0.0|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|        0.0|           0.0|        1.0|       0.0|
|     26.88|3.12|  M

In [0]:
# VectorAssembler is used for combining multiple features into a single feature vector, in order to train ML models
from pyspark.ml.feature import VectorAssembler
featureassembler=VectorAssembler(inputCols=['tip','size','sex_indexed','smoker_indexed','day_indexed','time_index'],        
                                outputCol="Independent Features")
output=featureassembler.transform(df_r)

In [0]:
output.show()

+----------+----+------+------+---+------+----+-----------+--------------+-----------+----------+--------------------+
|total_bill| tip|   sex|smoker|day|  time|size|sex_indexed|smoker_indexed|day_indexed|time_index|Independent Features|
+----------+----+------+------+---+------+----+-----------+--------------+-----------+----------+--------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|        1.0|           0.0|        1.0|       0.0|[1.01,2.0,1.0,0.0...|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|       0.0|[1.66,3.0,0.0,0.0...|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|        0.0|           0.0|        1.0|       0.0|[3.5,3.0,0.0,0.0,...|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|        0.0|           0.0|        1.0|       0.0|[3.31,2.0,0.0,0.0...|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|        1.0|           0.0|        1.0|       0.0|[3.61,4.0,1.0,0.0...|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4| 

In [0]:
from pyspark.ml.regression import LinearRegression
#train test split
train,test = output.randomSplit([0.8,0.2])
regressor  = LinearRegression(featuresCol='Independent Features', labelCol='total_bill')
regressor  = regressor.fit(train)

In [0]:
print(regressor.coefficients)
print(regressor.intercept)

[3.2719537723109173,3.398008983650757,-1.178104353168574,2.7983941510858767,-0.08135242126223019,-1.0142044599371298]
0.9207151820882666


In [0]:
# Predictions
pred_results = regressor.evaluate(test)

In [0]:
pred_results.predictions.show()

+----------+----+------+------+----+------+----+-----------+--------------+-----------+----------+--------------------+------------------+
|total_bill| tip|   sex|smoker| day|  time|size|sex_indexed|smoker_indexed|day_indexed|time_index|Independent Features|        prediction|
+----------+----+------+------+----+------+----+-----------+--------------+-----------+----------+--------------------+------------------+
|      7.25|5.15|  Male|   Yes| Sun|Dinner|   2|        0.0|           1.0|        1.0|       0.0|[5.15,2.0,0.0,1.0...| 27.28433680661465|
|      7.56|1.44|  Male|    No|Thur| Lunch|   2|        0.0|           0.0|        2.0|       1.0|[1.44,2.0,0.0,0.0...|11.251437279055914|
|      8.58|1.92|  Male|   Yes| Fri| Lunch|   1|        0.0|           1.0|        3.0|       1.0|[1.92,1.0,0.0,1.0...| 12.14100783593804|
|       9.6| 4.0|Female|   Yes| Sun|Dinner|   2|        1.0|           1.0|        1.0|       0.0|[4.0,2.0,1.0,1.0,...| 22.34348561528852|
|     10.59|1.61|Female|   

In [0]:
pred_results.r2, pred_results.meanAbsoluteError, pred_results.meanSquaredError

(0.3439014109078985, 5.118904083727258, 49.20304364651471)