# Ex1 - Filtering and Sorting Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 43 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 36.0 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=f2c0d2f86c097b500b134628aa45b0d21a5ff020cc0a0479bcf198dbe25b52c0
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [2]:
from pyspark.sql import SparkSession, functions as f
from pyspark.files import SparkFiles 

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

In [48]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"

spark = SparkSession.Builder().appName("exercise21").getOrCreate()

spark.sparkContext.addFile(url)

df = spark.read.csv("file://"+SparkFiles.get("chipotle.tsv"), sep = "\t", header = True, inferSchema=True)
df.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: string (nullable = true)



In [50]:
from pyspark.sql.types import DecimalType

#trimmed_col = df.select(f.cast(DoubleType, f.regexp_replace("item_price", "^\$+","")))

df = df.withColumn("item_price",f.regexp_replace("item_price", "^\$+",""))
df = df.withColumn("item_price",f.col("item_price").cast(DecimalType(10,2)))

### Step 3. Assign it to a variable called chipo.

In [None]:
chipo = df

### Step 4. How many products cost more than $10.00?

In [52]:
df.select("item_name").filter(f.col("item_price") > 10).count()

1130

### Step 5. What is the price of each item? 
###### print a data frame with only two columns item_name and item_price

In [53]:
df.select("item_name", "item_name").show()

+--------------------+--------------------+
|           item_name|           item_name|
+--------------------+--------------------+
|Chips and Fresh T...|Chips and Fresh T...|
|                Izze|                Izze|
|    Nantucket Nectar|    Nantucket Nectar|
|Chips and Tomatil...|Chips and Tomatil...|
|        Chicken Bowl|        Chicken Bowl|
|        Chicken Bowl|        Chicken Bowl|
|       Side of Chips|       Side of Chips|
|       Steak Burrito|       Steak Burrito|
|    Steak Soft Tacos|    Steak Soft Tacos|
|       Steak Burrito|       Steak Burrito|
| Chips and Guacamole| Chips and Guacamole|
|Chicken Crispy Tacos|Chicken Crispy Tacos|
|  Chicken Soft Tacos|  Chicken Soft Tacos|
|        Chicken Bowl|        Chicken Bowl|
| Chips and Guacamole| Chips and Guacamole|
|Chips and Tomatil...|Chips and Tomatil...|
|     Chicken Burrito|     Chicken Burrito|
|     Chicken Burrito|     Chicken Burrito|
|         Canned Soda|         Canned Soda|
|        Chicken Bowl|        Ch

### Step 6. Sort by the name of the item

In [57]:
df.select("item_name").distinct().sort("item_name").show()

+--------------------+
|           item_name|
+--------------------+
|   6 Pack Soft Drink|
|       Barbacoa Bowl|
|    Barbacoa Burrito|
|Barbacoa Crispy T...|
| Barbacoa Salad Bowl|
| Barbacoa Soft Tacos|
|       Bottled Water|
|                Bowl|
|             Burrito|
|         Canned Soda|
|   Canned Soft Drink|
|       Carnitas Bowl|
|    Carnitas Burrito|
|Carnitas Crispy T...|
|      Carnitas Salad|
| Carnitas Salad Bowl|
| Carnitas Soft Tacos|
|        Chicken Bowl|
|     Chicken Burrito|
|Chicken Crispy Tacos|
+--------------------+
only showing top 20 rows



### Step 7. What was the quantity of the most expensive item ordered?

In [63]:
df.select("item_name", "quantity").filter(f.col("quantity") == df.select(f.max("quantity")).collect()[0][0]).show()

+--------------------+--------+
|           item_name|quantity|
+--------------------+--------+
|Chips and Fresh T...|      15|
+--------------------+--------+



### Step 8. How many times was a Veggie Salad Bowl ordered?

> Indented block



In [68]:
df.groupBy("item_name").sum("quantity").where(df.item_name.like("Veggie Salad Bowl")).show()

+-----------------+-------------+
|        item_name|sum(quantity)|
+-----------------+-------------+
|Veggie Salad Bowl|           18|
+-----------------+-------------+



### Step 9. How many times did someone order more than one Canned Soda?

In [89]:
df.filter((f.col("quantity") > 1) \
            & (f.col("item_name").like("Canned Soda")) \
          ).select(f.sum("quantity")).show()

+-------------+
|sum(quantity)|
+-------------+
|           42|
+-------------+

