### Step 1: Set the Storage Account Name & Access Key

In order to access data from Azure storage, we need to first provide the storage account name and access key.

In [2]:
storage_account_name = "taxistorage2019"
storage_account_access_key = "########"

#### Set up the Storage Account Access key.

In [4]:
spark.conf.set(
  "fs.azure.account.key."+storage_account_name+".blob.core.windows.net",
  storage_account_access_key)

### Step 2: Read the data

Now that we have specified our file metadata, we can create a DataFrame. Notice that we use an *option* to specify that we want to infer the schema from the file. We can also explicitly set this to a particular schema if we have one already.

First, let's create a DataFrame in Python.

In [6]:
#
dfTaxi = spark.read.format("csv").option("header", "true").option("inferschema", "true").load("wasbs://taxicab@taxistorage2019.blob.core.windows.net/train.csv")

###Step 3: Save the Data

We save two different copies of the data for comparison of speed to run queries: One in csv, avro, and parquet.

In [8]:
dfTaxi.write.mode("overwrite").option("header", "true").format("csv").saveAsTable("TaxiCSV")

In [9]:
dfTaxi.write.mode("overwrite").option("header", "true").format("avro").saveAsTable("TaxiAvro")

In [10]:
dfTaxi.write.mode("overwrite").option("header", "true").format("parquet").saveAsTable("TaxiParquet")

In [11]:
dfTaxi.write.mode("overwrite").option("header", "true").format("orc").saveAsTable("TaxiORC")

We can query this view using Spark SQL. For instance, we can perform a simple aggregation. Notice how we can use `%sql` to query the view from SQL.

In [13]:
%sql
--CSV
SELECT COUNT(*) from TaxiCSV

count(1)
1710670


In [14]:
%sql
--Avro
SELECT COUNT(*) FROM TaxiAvro

count(1)
1710670


In [15]:
%sql
--Parquet
SELECT COUNT(*) FROM TaxiParquet

count(1)
1710670


In [16]:
%sql
--ORC
SELECT COUNT(*) FROM TaxiORC

count(1)
1710670


###More complex query examples

#### Example 1

In [19]:
%sql
SELECT SUM(Case when DAY_TYPE = 'B' then 1 else 0 end) as Trips_On_Holiday
      , SUM(Case when DAY_TYPE = 'C' then 1 else 0 end) as Trips_Before_Holiday
      , SUM(Case when DAY_TYPE = 'A' then 1 else 0 end) as Trips_On_NormalDays
FROM TaxiCSV

Trips_On_Holiday,Trips_Before_Holiday,Trips_On_NormalDays
0,0,1710670


In [20]:
%sql
SELECT SUM(Case when DAY_TYPE = 'B' then 1 else 0 end) as Trips_On_Holiday
      , SUM(Case when DAY_TYPE = 'C' then 1 else 0 end) as Trips_Before_Holiday
      , SUM(Case when DAY_TYPE = 'A' then 1 else 0 end) as Trips_On_NormalDays
FROM TaxiAvro

Trips_On_Holiday,Trips_Before_Holiday,Trips_On_NormalDays
0,0,1710670


In [21]:
%sql
SELECT SUM(Case when DAY_TYPE = 'B' then 1 else 0 end) as Trips_On_Holiday
      , SUM(Case when DAY_TYPE = 'C' then 1 else 0 end) as Trips_Before_Holiday
      , SUM(Case when DAY_TYPE = 'A' then 1 else 0 end) as Trips_On_NormalDays
FROM taxiparquet

Trips_On_Holiday,Trips_Before_Holiday,Trips_On_NormalDays
0,0,1710670


In [22]:
%sql
SELECT SUM(Case when DAY_TYPE = 'B' then 1 else 0 end) as Trips_On_Holiday
      , SUM(Case when DAY_TYPE = 'C' then 1 else 0 end) as Trips_Before_Holiday
      , SUM(Case when DAY_TYPE = 'A' then 1 else 0 end) as Trips_On_NormalDays
FROM taxiorc

Trips_On_Holiday,Trips_Before_Holiday,Trips_On_NormalDays
0,0,1710670


####Example 2

In [24]:
re%sql
SELECT SUM(Case when MISSING_DATA = false then 1 else 0 end) as No_Missing_Data
      , SUM(Case when MISSING_DATA = true then 1 else 0 end) as Missing_Data
FROM taxicsv

No_Missing_Data,Missing_Data
1710660,10


In [25]:
%sql
SELECT SUM(Case when MISSING_DATA = false then 1 else 0 end) as No_Missing_Data
      , SUM(Case when MISSING_DATA = true then 1 else 0 end) as Missing_Data
FROM taxiparquet

No_Missing_Data,Missing_Data
1710660,10


In [26]:
%sql
SELECT SUM(Case when MISSING_DATA = false then 1 else 0 end) as No_Missing_Data
      , SUM(Case when MISSING_DATA = true then 1 else 0 end) as Missing_Data
FROM taxiorc

No_Missing_Data,Missing_Data
1710660,10


In [27]:
%sql
SELECT * FROM taxiorc 
limit 5

TRIP_ID,CALL_TYPE,ORIGIN_CALL,ORIGIN_STAND,TAXI_ID,TIMESTAMP,DAY_TYPE,MISSING_DATA,POLYLINE
1374839106620000649,A,31763.0,,20000649,1374839106,A,False,"[[-8.581896,41.180301],[-8.582706,41.180895],[-8.582733,41.180931],[-8.582157,41.181687],[-8.580843,41.18328],[-8.579268,41.184666],[-8.579736,41.185467],[-8.580177,41.185701],[-8.581032,41.186079],[-8.581779,41.185062],[-8.582472,41.18355],[-8.582553,41.182866],[-8.582877,41.1822],[-8.582895,41.182164],[-8.582913,41.182074],[-8.582922,41.182038],[-8.583003,41.181489],[-8.583498,41.180472],[-8.58429,41.178384],[-8.584974,41.177223],[-8.585217,41.176746],[-8.585622,41.175855],[-8.586306,41.174838],[-8.587359,41.172903],[-8.588493,41.171004],[-8.589762,41.169366],[-8.592219,41.167971],[-8.594262,41.167044],[-8.596764,41.16627],[-8.598879,41.165505],[-8.600085,41.164857],[-8.60049,41.164542],[-8.601615,41.163624],[-8.602299,41.16321],[-8.603532,41.162202],[-8.603784,41.16204],[-8.603784,41.162049],[-8.603793,41.162031],[-8.604306,41.161806],[-8.604972,41.161896],[-8.607042,41.162202],[-8.609058,41.162481],[-8.609535,41.161689],[-8.60958,41.161698],[-8.609499,41.161617],[-8.60976,41.160753],[-8.609877,41.160636],[-8.610021,41.160186],[-8.610093,41.158323],[-8.609994,41.156388],[-8.610012,41.155695],[-8.610021,41.154876],[-8.610075,41.154084],[-8.609841,41.153103],[-8.609841,41.151924],[-8.609895,41.151843],[-8.609904,41.151753],[-8.60994,41.15178],[-8.61093,41.150691],[-8.611191,41.14935],[-8.611083,41.147937],[-8.611668,41.147865],[-8.613189,41.148369],[-8.614539,41.14827]]"
1374837051620000682,B,,33.0,20000682,1374837051,A,False,"[[-8.600031,41.182677],[-8.600184,41.182722],[-8.599968,41.18274],[-8.5986,41.182371],[-8.597394,41.181804],[-8.598141,41.180184],[-8.598888,41.178654],[-8.599797,41.176863],[-8.599977,41.175036],[-8.59932,41.173929],[-8.598771,41.173056],[-8.598348,41.172336],[-8.597871,41.171544],[-8.597322,41.170149],[-8.597295,41.169555],[-8.597637,41.168124],[-8.598204,41.167062],[-8.597412,41.166882],[-8.597574,41.166198],[-8.598132,41.165865],[-8.598168,41.165865],[-8.598186,41.165892],[-8.598186,41.165883],[-8.598222,41.165883],[-8.598798,41.165037],[-8.599248,41.163678],[-8.599932,41.162184],[-8.600175,41.161716],[-8.600508,41.160924],[-8.600778,41.159979],[-8.600076,41.159187],[-8.599032,41.158863],[-8.599068,41.158827],[-8.599059,41.158809],[-8.598708,41.158332],[-8.599068,41.156586],[-8.599608,41.154786],[-8.599788,41.153058],[-8.599941,41.152032],[-8.600076,41.151213],[-8.600139,41.150871],[-8.600346,41.149503],[-8.599554,41.149404],[-8.5986,41.148603],[-8.598573,41.14854],[-8.598573,41.148549],[-8.598555,41.148558],[-8.598411,41.148405],[-8.596944,41.149206],[-8.595198,41.150322],[-8.594919,41.150502]]"
1374837471620000251,C,,,20000251,1374837471,A,False,"[[-8.615232,41.14107],[-8.61489,41.140827],[-8.61408,41.141106],[-8.613621,41.141403],[-8.610075,41.140962],[-8.609625,41.140737],[-8.609463,41.140197],[-8.60922,41.139441],[-8.609391,41.139081],[-8.610165,41.138721],[-8.610453,41.138568],[-8.611191,41.138199],[-8.61237,41.137929],[-8.613468,41.137668],[-8.613711,41.137623],[-8.614305,41.137479],[-8.614737,41.137371],[-8.614809,41.137353],[-8.615097,41.137281],[-8.615745,41.137137],[-8.616447,41.136984],[-8.617086,41.136921],[-8.618004,41.136867],[-8.619417,41.137092],[-8.620092,41.137623],[-8.620974,41.138478],[-8.621496,41.139351],[-8.622207,41.139918],[-8.622828,41.140665],[-8.62353,41.141322],[-8.624331,41.141745],[-8.625573,41.142123],[-8.627121,41.142429],[-8.628507,41.142753],[-8.629938,41.142843],[-8.631054,41.143059],[-8.632332,41.143374],[-8.633844,41.143869],[-8.635329,41.144535],[-8.636436,41.14503],[-8.637678,41.145345],[-8.638578,41.145777],[-8.640261,41.145399],[-8.642043,41.144625],[-8.641692,41.14287],[-8.641593,41.141952],[-8.641638,41.143059],[-8.641791,41.14485],[-8.639775,41.145327],[-8.637606,41.144337],[-8.635941,41.142852],[-8.635023,41.141331],[-8.635437,41.140638]]"
1374836122620000006,B,,14.0,20000006,1374836122,A,False,"[[-8.611047,41.149431],[-8.611083,41.149431],[-8.611101,41.149296],[-8.610804,41.149197],[-8.610777,41.149161],[-8.610363,41.149215],[-8.610066,41.149764],[-8.61003,41.150574],[-8.609661,41.151681],[-8.609931,41.153103],[-8.609328,41.153454],[-8.609283,41.153445],[-8.609265,41.153445],[-8.609265,41.153454],[-8.609283,41.153463],[-8.609238,41.153517],[-8.609445,41.153688],[-8.609517,41.153742],[-8.609625,41.153724],[-8.609679,41.153715],[-8.609724,41.153688],[-8.60994,41.153742],[-8.61156,41.153688],[-8.611938,41.15448],[-8.613063,41.155173],[-8.614539,41.155461],[-8.614944,41.155533],[-8.616771,41.155857],[-8.617536,41.155974],[-8.617536,41.155983],[-8.617995,41.156037],[-8.618463,41.156127],[-8.618931,41.156217],[-8.620425,41.156433],[-8.621361,41.156208],[-8.621388,41.156064]]"
1374834945620000258,B,,61.0,20000258,1374834945,A,False,"[[-8.599248,41.149134],[-8.598762,41.148756],[-8.598627,41.148495],[-8.598627,41.148477],[-8.598636,41.148477],[-8.602335,41.148963],[-8.604864,41.149368],[-8.606034,41.149719],[-8.606529,41.149962],[-8.606601,41.149998],[-8.607375,41.150115],[-8.607942,41.150313],[-8.608617,41.150376],[-8.609481,41.150628],[-8.609454,41.150718],[-8.609436,41.150709]]"
