# Retrive taxi trip data for New York city

This notebook retrieves data on taxi trips, which is provided by the New York City. The data is stored into Kafka, and then used by the other notebook in this project (Stream-data-from-Kafka-to-Cosmos-DB) to demonstrate how to write data into Cosmos.

The data set used by this notebook is from [2016 Green Taxi Trip Data](https://data.cityofnewyork.us/Transportation/2016-Green-Taxi-Trip-Data/hvrh-b6nb).

## To use this notebook

Jupyter Notebooks allow you to modify and run the code in this document. To run a section (known as a 'cell',) select it and then use CTRL + ENTER, or select the play button on the toolbar above. Note that each section already has some example output beneath it, so you can see what the results of running a cell will look like.

NOTE: You must run each cell in order, from top to bottom. Running cells out of order can result in an error.

## Requirements

* An Azure Virtual Network
* A Spark (2.4) on HDInsight 4.0 cluster, inside the virtual network
* A Kafka on HDInsight 4.0 cluster, inside the virtual network

## Load packages

Run the next cell to load packages used by this notebook:

* spark-streaming-kafka-0-8_2.10, version 2.2.0 - Used to write data to Kafka.
* gson version 2.4 - Used for JSON parsing.

__NOTE__: The first time you run this block, it may take a minute or longer. This happens because the Spark cluster must retrieve the packages from the Maven repository on the internet.

In [None]:
%%configure -f
{
    "conf": {
        "spark.jars.packages": "org.apache.spark:spark-streaming-kafka-0-8_2.10:2.2.0,com.google.code.gson:gson:2.4",
        "spark.jars.excludes": "org.scala-lang:scala-reflect,org.apache.spark:spark-tags_2.11"
    }
}

## Create the Kafka topic

In the next cell, you must provide the Zookeeper host information for your Kafka cluster. Use the following steps to get this information:

* From __Bash__ or other Unix shell:

    ```bash
CLUSTERNAME='the name of your HDInsight cluster'
PASSWORD='the password for your cluster login account'
curl -u admin:$PASSWORD -G https://$CLUSTERNAME.azurehdinsight.net/api/v1/clusters/$CLUSTERNAME/services/ZOOKEEPER/components/ZOOKEEPER_SERVER| grep -i host_name | awk 'NR==1{print $NF,":2181"}'|tr -d '"'|tr -d ' '
    ```

* From __Azure PowerShell__:

    ```powershell
$creds = Get-Credential -UserName "admin" -Message "Enter the HDInsight login"
$clusterName = Read-Host -Prompt "Enter the Kafka cluster name"
$resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.net/api/v1/clusters/$clusterName/services/ZOOKEEPER/components/ZOOKEEPER_SERVER" `
    -Credential $creds `
    -UseBasicParsing
$respObj = ConvertFrom-Json $resp.Content
$zkHosts = $respObj.host_components.HostRoles.host_name[0..1]
($zkHosts -join ":2181,") + ":2181"
    ````

The return value is similar to the following example:

`zk0-kafka.ztgnbfvxu2mudoa5h5zzc1uncg.cx.internal.cloudapp.net:2181,zk1-kafka.ztgnbfvxu2mudoa5h5zzc1uncg.cx.internal.cloudapp.net:2181`

Replace the `YOUR_ZOOKEEPER_HOSTS` in the next cell with the returned value, and then run the cell

In [None]:
%%bash 
/usr/hdp/current/kafka-broker/bin/kafka-topics.sh --create --replication-factor 3 --partitions 8 --topic tripdata --zookeeper 'YOUR_ZOOKEEPER_HOSTS'

## Retrieve data on taxi trips

Run the next cell to load data on taxi trips in New York City.

In [None]:
// Load the data from the New York City Taxi data REST API for 2016 Green Taxi Trip Data
val url="https://data.cityofnewyork.us/resource/pqfs-mqru.json"
val result = scala.io.Source.fromURL(url).mkString

// Since the REST API returns an array of items,
// it's easier to use as an array than deal with streaming
import com.google.gson.Gson
val gson = new Gson()
val jsonDataArray = gson.fromJson(result, classOf[Array[Object]])

println("Retrieved " + jsonDataArray.length + " rows of Taxi data.")

## Set the Kafka broker hosts information

In the next cell, replace YOUR_KAFKA_BROKER_HOSTS with the broker hosts for your Kafka cluster. This is used to write data to the Kafka cluster. To get the broker host information, use one of the following methods:

* From Bash or other Unix shell:

    ```bash
CLUSTERNAME='the name of your HDInsight cluster'
PASSWORD='the password for your cluster login account'
curl -u admin:$PASSWORD -G https://$CLUSTERNAME.azurehdinsight.net/api/v1/clusters/$CLUSTERNAME/services/KAFKA/components/KAFKA_BROKER| grep -i host_name | awk 'NR==1{print $NF,":9092"}'|tr -d '"'|tr -d ' '
    ```

* From Azure Powershell:

    ```powershell
$creds = Get-Credential -UserName "admin" -Message "Enter the HDInsight login"
$clusterName = Read-Host -Prompt "Enter the Kafka cluster name"
$resp = Invoke-WebRequest -Uri "https://$clusterName.azurehdinsight.net/api/v1/clusters/$clusterName/services/KAFKA/components/KAFKA_BROKER" `
  -Credential $creds `
  -UseBasicParsing
$respObj = ConvertFrom-Json $resp.Content
$brokerHosts = $respObj.host_components.HostRoles.host_name[0..1]
($brokerHosts -join ":9092,") + ":9092"
    ```

In [None]:
// The Kafka broker hosts and topic used to write to Kafka
val kafkaBrokers="YOUR_BROKER_HOSTS"
val kafkaTopic="tripdata"

## Send the data to Kafka

Run the following cell to begin streaming data to Kafka. There is a delay of 1 second (1000ms) after each send, so this cell will stay active several minutes. This provides you the time needed to load the other notebook and run the cells in it while data is flowing into Kafka.

In [None]:
// Import classes used to write to Kafka via a producer
import org.apache.kafka.clients.producer.{KafkaProducer, ProducerConfig, ProducerRecord}
import java.util.HashMap

// Create the Kafka producer
val producerProperties = new HashMap[String, Object]()
producerProperties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaBrokers)
producerProperties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
                           "org.apache.kafka.common.serialization.StringSerializer")
producerProperties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
                           "org.apache.kafka.common.serialization.StringSerializer")
val producer = new KafkaProducer[String, String](producerProperties)

// Iterate over data and emit to Kafka
jsonDataArray.foreach { row =>
                // Get the row as a JSON string
                val jsonData = gson.toJson(row)
                // Create the message for Kafka
                val message = new ProducerRecord[String, String](kafkaTopic, null, jsonData)
                // Send the message
                producer.send(message)
                // Sleep a bit between sends to simulate streaming data
                Thread.sleep(1000)
             }
producer.close()
println("Finished writting to Kafka")

## Load and run the Stream-data-from-Kafka-to-Cosmos-DB notebook

While the previous cell is active, load the other notebook in this project (Stream-data-from-Kafka-to-Cosmos-DB) and follow the steps in it.