# Ex1 - Filtering and Sorting Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [14]:
from pyspark.sql import SparkSession
import requests
from pyspark.sql.functions import asc, desc, regexp_replace
from pyspark.sql.types import DoubleType

In [2]:
spark = SparkSession.builder.master("local[2]").appName("chipotle").getOrCreate()

22/09/07 13:04:25 WARN Utils: Your hostname, xkeyscore resolves to a loopback address: 127.0.1.1; using 192.168.1.8 instead (on interface wlp0s20f3)
22/09/07 13:04:25 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/07 13:04:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

In [3]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"

In [6]:
r = requests.get(url)

with open("chipotle.tsv", "w") as f:
    f.write(r.text)

### Step 3. Assign it to a variable called chipo.

In [4]:
chipo = spark.read.options(header=True, inferSchema=True, delimiter="\t").csv("chipotle.tsv")

### Step 4. How many products cost more than $10.00?

In [18]:
# removing $ from each cell value in price to convert to float
df_chipo = chipo.withColumn('price', regexp_replace('item_price','[$]', '').cast(DoubleType()))
# df_chipo.show(10)
    

In [19]:
df_chipo.show(10)

+--------+--------+--------------------+--------------------+----------+-----+
|order_id|quantity|           item_name|  choice_description|item_price|price|
+--------+--------+--------------------+--------------------+----------+-----+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 | 2.39|
|       1|       1|                Izze|        [Clementine]|    $3.39 | 3.39|
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 | 3.39|
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 | 2.39|
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |16.98|
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |10.98|
|       3|       1|       Side of Chips|                NULL|    $1.69 | 1.69|
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |11.75|
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 | 9.25|
|       5|       1|       Steak Burrito|[Fresh Tomat

In [21]:
df_chipo.select('item_name').filter('price > 10.00').count()

1130

### Step 5. What is the price of each item? 
###### print a data frame with only two columns item_name and item_price

In [34]:
chipo.select('item_name', 'item_price').distinct().show(1000)

+--------------------+----------+
|           item_name|item_price|
+--------------------+----------+
|    Steak Soft Tacos|    $9.25 |
|    Barbacoa Burrito|    $9.25 |
|Chips and Mild Fr...|    $3.00 |
|       Carnitas Bowl|   $23.50 |
|     Chicken Burrito|   $10.58 |
|        Chicken Bowl|    $8.49 |
|  Chicken Salad Bowl|   $17.50 |
|       Bottled Water|    $3.00 |
|        Chicken Bowl|    $8.75 |
|        Chicken Bowl|   $21.96 |
|    Nantucket Nectar|    $6.78 |
|Chicken Crispy Tacos|   $10.98 |
|Chips and Tomatil...|    $2.39 |
|Chips and Fresh T...|   $44.25 |
|       Bottled Water|    $4.50 |
|      Veggie Burrito|   $33.75 |
|     Chicken Burrito|   $16.38 |
|         Veggie Bowl|    $8.75 |
|Chicken Crispy Tacos|   $11.25 |
|Chips and Tomatil...|    $5.90 |
|  Chicken Soft Tacos|    $8.75 |
|        Chicken Bowl|   $32.94 |
|   Canned Soft Drink|    $2.50 |
|          Steak Bowl|   $23.50 |
|       Chicken Salad|    $8.49 |
|    Barbacoa Burrito|   $11.48 |
|     Chicken 

### Step 6. Sort by the name of the item

In [19]:
chipo.select("*").sort(asc("item_name")).head(10)

[Row(order_id=264, quantity=1, item_name='6 Pack Soft Drink', choice_description='[Diet Coke]', item_price='$6.49 '),
 Row(order_id=520, quantity=1, item_name='6 Pack Soft Drink', choice_description='[Sprite]', item_price='$6.49 '),
 Row(order_id=298, quantity=1, item_name='6 Pack Soft Drink', choice_description='[Nestea]', item_price='$6.49 '),
 Row(order_id=148, quantity=1, item_name='6 Pack Soft Drink', choice_description='[Diet Coke]', item_price='$6.49 '),
 Row(order_id=306, quantity=1, item_name='6 Pack Soft Drink', choice_description='[Coke]', item_price='$6.49 '),
 Row(order_id=168, quantity=1, item_name='6 Pack Soft Drink', choice_description='[Diet Coke]', item_price='$6.49 '),
 Row(order_id=363, quantity=1, item_name='6 Pack Soft Drink', choice_description='[Coke]', item_price='$6.49 '),
 Row(order_id=230, quantity=1, item_name='6 Pack Soft Drink', choice_description='[Diet Coke]', item_price='$6.49 '),
 Row(order_id=422, quantity=1, item_name='6 Pack Soft Drink', choice_des

### Step 7. What was the quantity of the most expensive item ordered?

In [49]:
df_chipo.select('quantity').sort(desc('price')).show(1)

+--------+
|quantity|
+--------+
|      15|
+--------+
only showing top 1 row



### Step 8. How many times was a Veggie Salad Bowl ordered?

In [21]:
chipo.select("item_name").filter("item_name='Veggie Salad Bowl'").count()

18

### Step 9. How many times did someone order more than one Canned Soda?

In [27]:
chipo.select("item_name").filter("item_name='Canned Soda' and quantity>1").groupby("item_name").count().show()

+-----------+-----+
|  item_name|count|
+-----------+-----+
|Canned Soda|   20|
+-----------+-----+

