# Ex1 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

In [0]:
# Copy file from GitHub into DBFS
dbutils.fs.cp(
    "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv",
    "dbfs:/FileStore/tables/chipotle.tsv"
)

Out[1]: True

In [0]:
url = "dbfs:/FileStore/tables/chipotle.tsv"

df= spark.read.format('csv') \
    .option('inferSchema','True')\
        .option('header','True')\
            .option('delimiter','\t')\
                .load(url)

display(df)
df.printSchema()


order_id,quantity,item_name,choice_description,item_price
1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,Izze,[Clementine],$3.39
1,1,Nantucket Nectar,[Apple],$3.39
1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]]",$16.98
3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]",$10.98
3,1,Side of Chips,,$1.69
4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]",$11.75
4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Cheese, Sour Cream, Lettuce]]",$9.25
5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto Beans, Cheese, Sour Cream, Lettuce]]",$9.25


root
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: string (nullable = true)



### Step 3. Assign it to a variable called chipo.

In [0]:
chipo=df

### Step 4. See the first 10 entries

In [0]:
chipo.show(10)

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 |
|       1|       1|                Izze|        [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 |
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |
|       3|       1|       Side of Chips|                NULL|    $1.69 |
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 |
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|    $9.25 |
+--------+--------+--------------------+-----------

### Step 5. What is the number of observations in the dataset?

In [0]:
# Solution 1
chipo.count()

Out[12]: 4622

### Step 6. What is the number of columns in the dataset?

In [0]:
len(df.columns)

Out[14]: 5

### Step 7. Print the name of all the columns.

In [0]:
chipo.columns

Out[15]: ['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']

### Step 8. How is the dataset indexed?

In [0]:
from pyspark.sql.functions import *
df_indexed=df.withColumn("row_index",monotonically_increasing_id())
display(df_indexed)

order_id,quantity,item_name,choice_description,item_price,row_index
1,1,Chips and Fresh Tomato Salsa,,$2.39,0
1,1,Izze,[Clementine],$3.39,1
1,1,Nantucket Nectar,[Apple],$3.39,2
1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39,3
2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]]",$16.98,4
3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sour Cream, Guacamole, Lettuce]]",$10.98,5
3,1,Side of Chips,,$1.69,6
4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables, Black Beans, Pinto Beans, Cheese, Sour Cream, Guacamole, Lettuce]]",$11.75,7
4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Cheese, Sour Cream, Lettuce]]",$9.25,8
5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto Beans, Cheese, Sour Cream, Lettuce]]",$9.25,9


### Step 9. Which was the most-ordered item? 

In [0]:
most_ordered=df.groupBy("item_name") \
    .agg(sum("quantity").alias("Total_Ordered")) \
        .orderBy(col("Total_Ordered").desc())
most_ordered.show(1)

+------------+-------------+
|   item_name|Total_Ordered|
+------------+-------------+
|Chicken Bowl|          761|
+------------+-------------+
only showing top 1 row



### Step 10. What was the most ordered item in the choice_description column?

In [0]:
most_ordered_df=df.groupBy("choice_description")\
    .agg(sum(col("quantity")).alias("Total_Ordered"))\
        .orderBy(col("Total_Ordered").desc()).limit(1)
display(most_ordered_df)

choice_description,Total_Ordered
,1382


### Step 11. How many items were orderd in total?

In [0]:
df.agg(sum("quantity")).display()

sum(quantity)
4972


### Step 12. Turn the item price into a float

In [0]:
df_clean=df.withColumn("item_price",
                       regexp_replace(col("item_price"),"[$]","").cast('double'))
df_clean.printSchema()
df_clean.show(5)


root
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: double (nullable = true)

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|      2.39|
|       1|       1|                Izze|        [Clementine]|      3.39|
|       1|       1|    Nantucket Nectar|             [Apple]|      3.39|
|       1|       1|Chips and Tomatil...|                NULL|      2.39|
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|     16.98|
+--------+--------+--------------------+--------------------+----------+
only showing top 5 rows



#### Step 13. Check the item price type

In [0]:
df_clean.select("item_price").dtypes

Out[39]: [('item_price', 'double')]

### Step 14. How much was the revenue for the period in the dataset?

In [0]:
df_clean.withColumn("Revenue",col('quantity')*col('item_price'))\
.agg(sum("Revenue").alias("Total Revenue")).display()

Total Revenue
39237.020000000055


### Step 15. How many orders were made in the period?

In [0]:
df.select(countDistinct("order_id").alias("Total Orders")).display()

Total Orders
1834


### Step 16. What is the average revenue amount per order?

In [0]:
# Solution 1
order_revenue = df_clean.groupBy("order_id").agg(sum(col("item_price")*col("quantity")).alias("order_revenue"))
order_revenue.agg(avg("order_revenue").alias("Avg_Order_Value")).display()

Avg_Order_Value
21.394231188658736


### Step 17. How many different items are sold?

In [0]:
df.select(countDistinct("item_name")).display()

count(DISTINCT item_name)
50
