## Overview

This notebook will show you how to create and query a table or DataFrame that you uploaded to DBFS. [DBFS](https://docs.databricks.com/user-guide/dbfs-databricks-file-system.html) is a Databricks File System that allows you to store data for querying inside of Databricks. This notebook assumes that you have a file already inside of DBFS that you would like to read from.

This notebook is written in **Python** so the default cell type is Python. However, you can use different languages by using the `%LANGUAGE` syntax. Python, Scala, SQL, and R are all supported.

In [0]:
# File location and type
file_location = "/FileStore/tables/weatherHistory.csv"
file_type = "csv"

mainDf = spark.read.csv(file_location, header=True, inferSchema=True)

In [0]:
df = mainDf
df.show()

+-------------------+-------------+-----------+------------------+------------------------+--------+------------------+----------------------+------------------+----------+--------------------+--------------------+
|     Formatted Date|      Summary|Precip Type|   Temperature (C)|Apparent Temperature (C)|Humidity| Wind Speed (km/h)|Wind Bearing (degrees)|   Visibility (km)|Loud Cover|Pressure (millibars)|       Daily Summary|
+-------------------+-------------+-----------+------------------+------------------------+--------+------------------+----------------------+------------------+----------+--------------------+--------------------+
|2006-03-31 22:00:00|Partly Cloudy|       rain| 9.472222222222221|      7.3888888888888875|    0.89|           14.1197|                 251.0|15.826300000000002|       0.0|             1015.13|Partly cloudy thr...|
|2006-03-31 23:00:00|Partly Cloudy|       rain| 9.355555555555558|       7.227777777777776|    0.86|           14.2646|                 259.

In [0]:
df.printSchema()

root
 |-- Formatted Date: timestamp (nullable = true)
 |-- Summary: string (nullable = true)
 |-- Precip Type: string (nullable = true)
 |-- Temperature (C): double (nullable = true)
 |-- Apparent Temperature (C): double (nullable = true)
 |-- Humidity: double (nullable = true)
 |-- Wind Speed (km/h): double (nullable = true)
 |-- Wind Bearing (degrees): double (nullable = true)
 |-- Visibility (km): double (nullable = true)
 |-- Loud Cover: double (nullable = true)
 |-- Pressure (millibars): double (nullable = true)
 |-- Daily Summary: string (nullable = true)



In [0]:
df.columns

Out[182]: ['Formatted Date',
 'Summary',
 'Precip Type',
 'Temperature (C)',
 'Apparent Temperature (C)',
 'Humidity',
 'Wind Speed (km/h)',
 'Wind Bearing (degrees)',
 'Visibility (km)',
 'Loud Cover',
 'Pressure (millibars)',
 'Daily Summary']

In [0]:
df = df.drop("Formatted Date")
df = df.drop("Loud Cover")
df = df.drop("Daily Summary")

In [0]:
df.show()

+-------------+-----------+------------------+------------------------+--------+------------------+----------------------+------------------+--------------------+
|      Summary|Precip Type|   Temperature (C)|Apparent Temperature (C)|Humidity| Wind Speed (km/h)|Wind Bearing (degrees)|   Visibility (km)|Pressure (millibars)|
+-------------+-----------+------------------+------------------------+--------+------------------+----------------------+------------------+--------------------+
|Partly Cloudy|       rain| 9.472222222222221|      7.3888888888888875|    0.89|           14.1197|                 251.0|15.826300000000002|             1015.13|
|Partly Cloudy|       rain| 9.355555555555558|       7.227777777777776|    0.86|           14.2646|                 259.0|15.826300000000002|             1015.63|
|Mostly Cloudy|       rain| 9.377777777777778|       9.377777777777778|    0.89|3.9284000000000003|                 204.0|           14.9569|             1015.94|
|Partly Cloudy|       

In [0]:
df.count()

Out[185]: 96453

In [0]:
df.printSchema()

root
 |-- Summary: string (nullable = true)
 |-- Precip Type: string (nullable = true)
 |-- Temperature (C): double (nullable = true)
 |-- Apparent Temperature (C): double (nullable = true)
 |-- Humidity: double (nullable = true)
 |-- Wind Speed (km/h): double (nullable = true)
 |-- Wind Bearing (degrees): double (nullable = true)
 |-- Visibility (km): double (nullable = true)
 |-- Pressure (millibars): double (nullable = true)



In [0]:
df.columns

Out[187]: ['Summary',
 'Precip Type',
 'Temperature (C)',
 'Apparent Temperature (C)',
 'Humidity',
 'Wind Speed (km/h)',
 'Wind Bearing (degrees)',
 'Visibility (km)',
 'Pressure (millibars)']

In [0]:
###missing values will be replaced by mean
from pyspark.ml.feature import Imputer
imputer=Imputer(
inputCols=[
 'Apparent Temperature (C)',
 'Humidity',
 'Wind Speed (km/h)',
 'Wind Bearing (degrees)',
 'Visibility (km)',
 'Pressure (millibars)'],
outputCols=[
 'Apparent Temperature (C)_imputed',
 'Humidity_imputed',
 'Wind Speed (km/h)_imputed',
 'Wind Bearing (degrees)_imputed',
 'Visibility (km)_imputed',
 'Pressure (millibars)_imputed']
).setStrategy("mean")

In [0]:
#ADD imputation cols to df
df = imputer.fit(df).transform(df)

In [0]:
df.printSchema()

root
 |-- Summary: string (nullable = true)
 |-- Precip Type: string (nullable = true)
 |-- Temperature (C): double (nullable = true)
 |-- Apparent Temperature (C): double (nullable = true)
 |-- Humidity: double (nullable = true)
 |-- Wind Speed (km/h): double (nullable = true)
 |-- Wind Bearing (degrees): double (nullable = true)
 |-- Visibility (km): double (nullable = true)
 |-- Pressure (millibars): double (nullable = true)
 |-- Apparent Temperature (C)_imputed: double (nullable = true)
 |-- Humidity_imputed: double (nullable = true)
 |-- Wind Speed (km/h)_imputed: double (nullable = true)
 |-- Wind Bearing (degrees)_imputed: double (nullable = true)
 |-- Visibility (km)_imputed: double (nullable = true)
 |-- Pressure (millibars)_imputed: double (nullable = true)



In [0]:
df = df.na.drop(how="all")

In [0]:
df.count()

Out[192]: 96453

In [0]:
from pyspark.ml.feature import (VectorAssembler, OneHotEncoder,
                                StringIndexer)

In [0]:
Summary_indexer = StringIndexer(inputCol='Summary', outputCol='SummaryIndex')
df = Summary_indexer.fit(df).transform(df)
Summary_encoder = OneHotEncoder(inputCol='SummaryIndex', outputCol='SummaryVec')
df = Summary_encoder.fit(df).transform(df)

In [0]:
Precip_Type_indexer = StringIndexer(inputCol='Precip Type', outputCol='Precip_Type_Index')
df = Precip_Type_indexer.fit(df).transform(df)
Precip_Type_encoder = OneHotEncoder(inputCol='Precip_Type_Index', outputCol='Precip_Type_Vec')
df = Precip_Type_encoder.fit(df).transform(df)

In [0]:
df.columns

Out[196]: ['Summary',
 'Precip Type',
 'Temperature (C)',
 'Apparent Temperature (C)',
 'Humidity',
 'Wind Speed (km/h)',
 'Wind Bearing (degrees)',
 'Visibility (km)',
 'Pressure (millibars)',
 'Apparent Temperature (C)_imputed',
 'Humidity_imputed',
 'Wind Speed (km/h)_imputed',
 'Wind Bearing (degrees)_imputed',
 'Visibility (km)_imputed',
 'Pressure (millibars)_imputed',
 'SummaryIndex',
 'SummaryVec',
 'Precip_Type_Index',
 'Precip_Type_Vec']

In [0]:
featureassembler = VectorAssembler(inputCols=['Apparent Temperature (C)_imputed',
 'Humidity_imputed',
 'Wind Speed (km/h)_imputed',
 'Wind Bearing (degrees)_imputed',
 'Visibility (km)_imputed',
 'Pressure (millibars)_imputed','SummaryVec','Precip_Type_Vec'],outputCol='features')

In [0]:
output=featureassembler.transform(df)

In [0]:
finalized_output=output.select("features","Temperature (C)")

In [0]:
finalized_output.show()

+--------------------+------------------+
|            features|   Temperature (C)|
+--------------------+------------------+
|(34,[0,1,2,3,4,5,...| 9.472222222222221|
|(34,[0,1,2,3,4,5,...| 9.355555555555558|
|(34,[0,1,2,3,4,5,...| 9.377777777777778|
|(34,[0,1,2,3,4,5,...|  8.28888888888889|
|(34,[0,1,2,3,4,5,...| 8.755555555555553|
|(34,[0,1,2,3,4,5,...| 9.222222222222221|
|(34,[0,1,2,3,4,5,...| 7.733333333333334|
|(34,[0,1,2,3,4,5,...|  8.77222222222222|
|(34,[0,1,2,3,4,5,...| 10.82222222222222|
|(34,[0,1,2,3,4,5,...| 13.77222222222222|
|(34,[0,1,2,3,4,5,...|16.016666666666666|
|(34,[0,1,2,3,4,5,...|17.144444444444446|
|(34,[0,1,2,3,4,5,...|17.800000000000004|
|(34,[0,1,2,3,4,5,...|17.333333333333332|
|(34,[0,1,2,3,4,5,...| 18.87777777777778|
|(34,[0,1,2,3,4,5,...|18.911111111111115|
|(34,[0,1,2,3,4,5,...| 15.38888888888889|
|(34,[0,1,2,3,4,5,...|15.550000000000002|
|(34,[0,1,2,3,4,5,...|14.255555555555553|
|(34,[0,1,2,3,4,5,...|13.144444444444442|
+--------------------+------------

In [0]:
##now we will do train test split
from pyspark.ml.regression import LinearRegression
train_data,test_data=finalized_output.randomSplit([0.75,0.25])
regressor=LinearRegression(featuresCol='features', labelCol='Temperature (C)')
regressor=regressor.fit(train_data)

In [0]:
###coefficients
regressor.coefficients

Out[202]: DenseVector([0.8734, -1.3554, 0.0858, -0.0004, 0.0011, -0.0002, 1.5176, 1.4001, 1.4789, 1.4837, 1.2275, 1.3727, 0.3933, 0.5863, 2.1516, 0.2259, 1.4207, 1.9709, 1.0441, 1.8947, 1.2436, 3.1701, -0.5444, 2.4261, 1.9003, 2.317, 1.4273, 1.0051, 2.0553, -1.6904, -0.9261, 0.0585, 0.4011, 0.4779])

In [0]:
##intercepts
regressor.intercept

Out[203]: 0.8877719119658549

In [0]:
##prediction
pred_results=regressor.evaluate(test_data)

In [0]:
pred_results.predictions.show()

+--------------------+--------------------+--------------------+
|            features|     Temperature (C)|          prediction|
+--------------------+--------------------+--------------------+
|(34,[0,1,2,3,4,5,...|  0.1388888888888889| -1.4683616556182477|
|(34,[0,1,2,3,4,5,...| 0.07222222222222364|  -1.365498164361702|
|(34,[0,1,2,3,4,5,...|    0.68888888888889| -0.8047694705318902|
|(34,[0,1,2,3,4,5,...| 0.11666666666666714| -1.2059724909357181|
|(34,[0,1,2,3,4,5,...| 0.07222222222222364| -1.0181188651021262|
|(34,[0,1,2,3,4,5,...| 0.11666666666666714|  -0.979574627738069|
|(34,[0,1,2,3,4,5,...|0.022222222222221748| -0.9508451890599838|
|(34,[0,1,2,3,4,5,...|  1.0888888888888895|-0.16543191543608482|
|(34,[0,1,2,3,4,5,...|   0.944444444444446|-0.05470584991819272|
|(34,[0,1,2,3,4,5,...|  0.8999999999999986|-0.27921089439742164|
|(34,[0,1,2,3,4,5,...|   0.344444444444443| -0.7300307701731183|
|(34,[0,1,2,3,4,5,...| 0.11666666666666714| -0.7186545323605458|
|(34,[0,1,2,3,4,5,...|  1

In [0]:
##now check how the model is performed
pred_results.meanAbsoluteError,pred_results.meanSquaredError

Out[206]: (0.7341576094069727, 0.8807168916030339)