# Ex1 - Filtering and Sorting Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("My Application").getOrCreate()

25/05/08 10:30:26 WARN Utils: Your hostname, neosoft-Latitude-E7270 resolves to a loopback address: 127.0.1.1; using 10.0.62.133 instead (on interface wlp1s0)
25/05/08 10:30:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/08 10:30:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

### Step 3. Assign it to a variable called chipo.

In [4]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

import pandas as pd

chipo = spark.read.format("csv").option("header", "true").option("sep", "\t").load("04_chipole.tsv")

chipo.show()


                                                                                

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 |
|       1|       1|                Izze|        [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 |
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |
|       3|       1|       Side of Chips|                NULL|    $1.69 |
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 |
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|    $9.25 |
|       5|       1| Chips and Guacamole|           

### Step 4. How many products cost more than $10.00?

In [6]:
chipo_clean = chipo.withColumn("item_price_clean", regexp_replace("item_price", "[$]", "").cast("float"))

chipo_clean.filter(col("item_price_clean") > 10.00).show()


+--------+--------+------------------+--------------------+----------+----------------+
|order_id|quantity|         item_name|  choice_description|item_price|item_price_clean|
+--------+--------+------------------+--------------------+----------+----------------+
|       2|       2|      Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |           16.98|
|       3|       1|      Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |           10.98|
|       4|       1|     Steak Burrito|[Tomatillo Red Ch...|   $11.75 |           11.75|
|       7|       1|      Chicken Bowl|[Fresh Tomato Sal...|   $11.25 |           11.25|
|      12|       1|   Chicken Burrito|[[Tomatillo-Green...|   $10.98 |           10.98|
|      19|       1|     Barbacoa Bowl|[Roasted Chili Co...|   $11.75 |           11.75|
|      20|       1|      Chicken Bowl|[Roasted Chili Co...|   $11.25 |           11.25|
|      20|       1|     Steak Burrito|[Fresh Tomato Sal...|   $11.75 |           11.75|
|      21|       1|   Chicken Bu

### Step 5. What is the price of each item? 
###### print a data frame with only two columns item_name and item_price

In [7]:
chipo_clean.select("item_name", "item_price").show()

+--------------------+----------+
|           item_name|item_price|
+--------------------+----------+
|Chips and Fresh T...|    $2.39 |
|                Izze|    $3.39 |
|    Nantucket Nectar|    $3.39 |
|Chips and Tomatil...|    $2.39 |
|        Chicken Bowl|   $16.98 |
|        Chicken Bowl|   $10.98 |
|       Side of Chips|    $1.69 |
|       Steak Burrito|   $11.75 |
|    Steak Soft Tacos|    $9.25 |
|       Steak Burrito|    $9.25 |
| Chips and Guacamole|    $4.45 |
|Chicken Crispy Tacos|    $8.75 |
|  Chicken Soft Tacos|    $8.75 |
|        Chicken Bowl|   $11.25 |
| Chips and Guacamole|    $4.45 |
|Chips and Tomatil...|    $2.39 |
|     Chicken Burrito|    $8.49 |
|     Chicken Burrito|    $8.49 |
|         Canned Soda|    $2.18 |
|        Chicken Bowl|    $8.75 |
+--------------------+----------+
only showing top 20 rows



### Step 6. Sort by the name of the item

In [8]:
chipo_clean.orderBy("item_name").show()

+--------+--------+-----------------+------------------+----------+----------------+
|order_id|quantity|        item_name|choice_description|item_price|item_price_clean|
+--------+--------+-----------------+------------------+----------+----------------+
|     511|       1|6 Pack Soft Drink|            [Coke]|    $6.49 |            6.49|
|    1253|       1|6 Pack Soft Drink|        [Lemonade]|    $6.49 |            6.49|
|     520|       1|6 Pack Soft Drink|          [Sprite]|    $6.49 |            6.49|
|     148|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49 |            6.49|
|     566|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49 |            6.49|
|     168|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49 |            6.49|
|     708|       1|6 Pack Soft Drink|            [Coke]|    $6.49 |            6.49|
|     230|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49 |            6.49|
|     709|       1|6 Pack Soft Drink|       [Diet Coke]|    $6.49

                                                                                

### Step 7. What was the quantity of the most expensive item ordered?

In [11]:
chipo_clean.select("item_name", "quantity").orderBy(desc("item_price_clean")).show(1)

+--------------------+--------+
|           item_name|quantity|
+--------------------+--------+
|Chips and Fresh T...|      15|
+--------------------+--------+
only showing top 1 row



### Step 8. How many times was a Veggie Salad Bowl ordered?

In [14]:
from pyspark.sql.functions import sum, col

chipo_clean.filter(col("item_name") == "Veggie Salad Bowl") \
           .agg(sum("quantity").alias("total_quantity")) \
           .show()


+--------------+
|total_quantity|
+--------------+
|          18.0|
+--------------+



### Step 9. How many times did someone order more than one Canned Soda?

In [18]:
from pyspark.sql.functions import col

chipo_clean.filter(
    (col("item_name") == "Canned Soda") & (col("quantity") > 1)
).show()


+--------+--------+-----------+------------------+----------+----------------+
|order_id|quantity|  item_name|choice_description|item_price|item_price_clean|
+--------+--------+-----------+------------------+----------+----------------+
|       9|       2|Canned Soda|          [Sprite]|    $2.18 |            2.18|
|      23|       2|Canned Soda|    [Mountain Dew]|    $2.18 |            2.18|
|      73|       2|Canned Soda|       [Diet Coke]|    $2.18 |            2.18|
|      76|       2|Canned Soda| [Diet Dr. Pepper]|    $2.18 |            2.18|
|     150|       2|Canned Soda|       [Diet Coke]|    $2.18 |            2.18|
|     151|       2|Canned Soda|       [Coca Cola]|    $2.18 |            2.18|
|     287|       2|Canned Soda|       [Coca Cola]|    $2.18 |            2.18|
|     288|       2|Canned Soda|       [Coca Cola]|    $2.18 |            2.18|
|     376|       2|Canned Soda|    [Mountain Dew]|    $2.18 |            2.18|
|     450|       2|Canned Soda|      [Dr. Pepper]|  