# Ex2 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.

*   List item
*   List item


Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [2]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 55 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 84.3 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=2bd4479bd77d33398bf099e6e257a6d9ce5527541da898eeb66d43bd57f28d33
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [15]:
from pyspark.sql import SparkSession
from pyspark import SparkFiles
import requests

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

In [19]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"
spark = SparkSession.builder.appName("Exercise1").getOrCreate()
spark.sparkContext.addFile(url)

### Step 3. Assign it to a variable called chipo.

In [159]:
chipo = spark.read.option("sep", "\t").csv("file://" + SparkFiles.get("chipotle.tsv"), header='true', inferSchema = True)
chipo.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: string (nullable = true)



### Step 4. See the first 10 entries

In [53]:
chipo.show()

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 |
|       1|       1|                Izze|        [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 |
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |
|       3|       1|       Side of Chips|                NULL|    $1.69 |
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 |
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|    $9.25 |
|       5|       1| Chips and Guacamole|           

### Step 5. What is the number of observations in the dataset?

In [33]:
# Solution 1
chipo.count()

4622

In [None]:
# Solution 2


### Step 6. What is the number of columns in the dataset?

In [34]:
len(chipo.columns)

5

### Step 7. Print the name of all the columns.

In [35]:
print(*chipo.columns)

order_id quantity item_name choice_description item_price


### Step 8. How is the dataset indexed?

### Step 9. Which was the most-ordered item? 

In [98]:
import pyspark.sql.functions as f
topitem = chipo.groupby("item_name").sum("quantity").sort("sum(quantity)").orderBy(f.desc("sum(quantity)")).limit(1)
topitem.select("item_name").show()

+------------+
|   item_name|
+------------+
|Chicken Bowl|
+------------+



### Step 10. For the most-ordered item, how many items were ordered?

In [99]:
topitem.show()

+------------+-------------+
|   item_name|sum(quantity)|
+------------+-------------+
|Chicken Bowl|          761|
+------------+-------------+



### Step 11. What was the most ordered item in the choice_description column?

> Indented block



In [136]:
chipo.groupBy("choice_description").count().filter(f.col("choice_description") != "NULL").orderBy(f.desc("count")).limit(1).show()


+------------------+-----+
|choice_description|count|
+------------------+-----+
|       [Diet Coke]|  134|
+------------------+-----+



### Step 12. How many items were orderd in total?

In [137]:
chipo.select("quantity").count()

4622

### Step 13. Turn the item price into a float

#### Step 13.a. Check the item price type

In [139]:
chipo.select("item_price")

DataFrame[item_price: string]

#### Step 13.b. Create a lambda function and change the type of item price

In [160]:
chipo = chipo.select("order_id", "quantity", "item_name", "choice_description", f.regexp_replace(chipo.item_price, '\$+', '').cast("double").alias("item_price"))
chipo.show()

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|      2.39|
|       1|       1|                Izze|        [Clementine]|      3.39|
|       1|       1|    Nantucket Nectar|             [Apple]|      3.39|
|       1|       1|Chips and Tomatil...|                NULL|      2.39|
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|     16.98|
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|     10.98|
|       3|       1|       Side of Chips|                NULL|      1.69|
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|     11.75|
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|      9.25|
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|      9.25|
|       5|       1| Chips and Guacamole|           

#### Step 13.c. Check the item price type

In [161]:
chipo.select("item_price")

DataFrame[item_price: double]

### Step 14. How much was the revenue for the period in the dataset?

In [169]:
chipo.select((chipo.quantity * chipo.item_price).alias("cost")).select(f.sum("cost")).show()

+------------------+
|         sum(cost)|
+------------------+
|39237.020000000055|
+------------------+



### Step 15. How many orders were made in the period?

In [175]:
chipo.select("order_id").distinct().count()

1834

### Step 16. What is the average revenue amount per order?

In [186]:
# Solution 1
chipo.select(chipo.order_id, (chipo.quantity * chipo.item_price).alias("cost"))\
  .groupBy("order_id")\
  .sum("cost")\
  .agg({"sum(cost)":"avg"})\
  .show()

+------------------+
|    avg(sum(cost))|
+------------------+
|21.394231188658736|
+------------------+



In [None]:
# Solution 2



### Step 17. How many different items are sold?

In [190]:
chipo.select("item_name").distinct().count()

50