# Ex1 - Filtering and Sorting Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [2]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488491 sha256=45ed1420e3d7e76f80f2d4ef1c85bce79ca35c5a712326d5180e5ffd9f71fdba
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv).

In [1]:
!wget -O chipotle.tsv https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv

--2024-04-05 12:46:47--  https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 364975 (356K) [text/plain]
Saving to: ‘chipotle.tsv’


2024-04-05 12:46:48 (7.08 MB/s) - ‘chipotle.tsv’ saved [364975/364975]



### Step 3. Assign it to a variable called chipo.

In [18]:
chipo = spark.read.csv("chipotle.tsv", sep='\t', header=True, inferSchema=True)

### Step 4. How many products cost more than $10.00?

In [10]:
from pyspark.sql.types import IntegerType, FloatType
from pyspark.sql.functions import expr, col

In [19]:
chipo = chipo.withColumn('item_price', expr("substring(item_price, 2, length(item_price))").cast(FloatType()))

In [20]:
chipo_filtered=chipo.drop_duplicates(['item_name','quantity','choice_description'])

In [26]:
chipo_filtered.filter((col('item_price')>10) & (col('quantity')==1)).distinct().show()

+--------+--------+------------------+--------------------+----------+
|order_id|quantity|         item_name|  choice_description|item_price|
+--------+--------+------------------+--------------------+----------+
|     519|       1|     Steak Burrito|[Fresh Tomato Sal...|     11.48|
|     635|       1|        Steak Bowl|[Tomatillo Green ...|     11.75|
|      61|       1|     Barbacoa Bowl|[Tomatillo Red Ch...|     11.75|
|     374|       1|  Barbacoa Burrito|[Fresh Tomato Sal...|     11.75|
|    1036|       1|      Chicken Bowl|[Fresh Tomato Sal...|     10.98|
|    1368|       1|Chicken Salad Bowl|[Fresh Tomato Sal...|     11.25|
|    1736|       1|  Barbacoa Burrito|[Tomatillo Red Ch...|     11.75|
|     468|       1|        Steak Bowl|[Tomatillo Green ...|     11.75|
|     552|       1|Chicken Salad Bowl|[Roasted Chili Co...|     11.25|
|    1087|       1|   Chicken Burrito|[Tomatillo Green ...|     11.25|
|    1812|       1|   Chicken Burrito|[Tomatillo Red Ch...|     11.25|
|     

### Step 5. What is the price of each item?
###### print a data frame with only two columns item_name and item_price

In [29]:
chipo.select(col('item_name'), col('item_price')).filter(col('quantity')==1).distinct().show()

+--------------------+----------+
|           item_name|item_price|
+--------------------+----------+
|        Chicken Bowl|     10.98|
| Chips and Guacamole|      4.45|
|         Veggie Bowl|     11.25|
|      Veggie Burrito|     10.98|
|Chips and Roasted...|      2.95|
|       Chicken Salad|      8.19|
|    Barbacoa Burrito|     11.48|
|        Chicken Bowl|      8.19|
|          Steak Bowl|      8.69|
|   Veggie Soft Tacos|     11.25|
|Chicken Crispy Tacos|      8.49|
|         Steak Salad|      8.99|
|          Steak Bowl|     11.08|
| Chips and Guacamole|      3.89|
|         Steak Salad|      8.69|
|Barbacoa Crispy T...|     11.48|
|    Carnitas Burrito|      8.99|
| Barbacoa Salad Bowl|     11.89|
|  Steak Crispy Tacos|      8.99|
|    Barbacoa Burrito|      8.99|
+--------------------+----------+
only showing top 20 rows



### Step 6. Sort by the name of the item

In [30]:
chipo.orderBy('item_name').show()

+--------+--------+-----------------+------------------+----------+
|order_id|quantity|        item_name|choice_description|item_price|
+--------+--------+-----------------+------------------+----------+
|     511|       1|6 Pack Soft Drink|            [Coke]|      6.49|
|    1253|       1|6 Pack Soft Drink|        [Lemonade]|      6.49|
|     520|       1|6 Pack Soft Drink|          [Sprite]|      6.49|
|     148|       1|6 Pack Soft Drink|       [Diet Coke]|      6.49|
|     566|       1|6 Pack Soft Drink|       [Diet Coke]|      6.49|
|     168|       1|6 Pack Soft Drink|       [Diet Coke]|      6.49|
|     708|       1|6 Pack Soft Drink|            [Coke]|      6.49|
|     230|       1|6 Pack Soft Drink|       [Diet Coke]|      6.49|
|     709|       1|6 Pack Soft Drink|       [Diet Coke]|      6.49|
|     298|       1|6 Pack Soft Drink|          [Nestea]|      6.49|
|     749|       1|6 Pack Soft Drink|            [Coke]|      6.49|
|     363|       1|6 Pack Soft Drink|           

### Step 7. What was the quantity of the most expensive item ordered?

In [35]:
chipo.select(col('item_name'), col('quantity'), col('item_price')).orderBy(col('item_price'), ascending=False).show(truncate=False)

+----------------------------+--------+----------+
|item_name                   |quantity|item_price|
+----------------------------+--------+----------+
|Chips and Fresh Tomato Salsa|15      |44.25     |
|Carnitas Bowl               |3       |35.25     |
|Chicken Burrito             |4       |35.0      |
|Chicken Burrito             |4       |35.0      |
|Veggie Burrito              |3       |33.75     |
|Chicken Bowl                |3       |32.94     |
|Steak Burrito               |3       |27.75     |
|Steak Burrito               |3       |27.75     |
|Chicken Bowl                |3       |26.25     |
|Chicken Burrito             |3       |26.25     |
|Chicken Burrito             |3       |26.25     |
|Steak Bowl                  |3       |26.07     |
|Steak Salad Bowl            |2       |23.78     |
|Steak Salad Bowl            |2       |23.78     |
|Carnitas Bowl               |2       |23.5      |
|Steak Bowl                  |2       |23.5      |
|Steak Burrito               |2

### Step 8. How many times was a Veggie Salad Bowl ordered?

In [38]:
chipo.filter(col('item_name')=='Veggie Salad Bowl').count()

18

### Step 9. How many times did someone order more than one Canned Soda?

In [43]:
chipo.filter((col('item_name')=='Canned Soda') & (col('quantity')>1)).count()

20