# Ex2 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [127]:
from pyspark.sql import SparkSession
import requests
from pyspark.sql.functions import asc, desc, col, udf
from pyspark.sql.types import (
    StringType, BooleanType, IntegerType, FloatType, DateType, DoubleType
)

In [5]:
# Create SparkSession 
spark = SparkSession.builder \
      .master("local[1]") \
      .appName("basics") \
      .getOrCreate() 

22/08/29 23:42:14 WARN Utils: Your hostname, xkeyscore resolves to a loopback address: 127.0.1.1; using 192.168.1.8 instead (on interface wlp0s20f3)
22/08/29 23:42:14 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/08/29 23:42:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

In [6]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"

In [15]:
r = requests.get(url)
with open("chipotle.tsv", "w") as f:
    f.write(r.text)

### Step 3. Assign it to a variable called chipo.

In [31]:
# chipo = spark.read.option(delimiter,',').option(header,True).option(inferSchema,True).csv("chipotle.tsv")
chipo = spark.read.options(header='True', inferSchema='True', delimiter='\t') \
  .csv("chipotle.tsv")

### Step 4. See the first 10 entries

In [33]:
chipo.show(10)

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 |
|       1|       1|                Izze|        [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 |
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |
|       3|       1|       Side of Chips|                NULL|    $1.69 |
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 |
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|    $9.25 |
+--------+--------+--------------------+-----------

### Step 5. What is the number of observations in the dataset?

In [37]:
# Solution 1
chipo.count()

4622

In [39]:
# Solution 2

### Step 6. What is the number of columns in the dataset?

In [40]:
len(chipo.columns)

5

### Step 7. Print the name of all the columns.

In [41]:
print(chipo.columns)

['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']


### Step 8. How is the dataset indexed?

In [42]:
# Not sure

### Step 9. Which was the most-ordered item? 

In [59]:
df1 = chipo.select("item_name").groupby('item_name').count().sort(desc("count"))
df1.show(1)

+------------+-----+
|   item_name|count|
+------------+-----+
|Chicken Bowl|  726|
+------------+-----+
only showing top 1 row



### Step 10. For the most-ordered item, how many items were ordered?

In [70]:
df1.filter("item_name = 'Chicken Bowl'").select("count").show()

+-----+
|count|
+-----+
|  726|
+-----+



### Step 11. What was the most ordered item in the choice_description column?

In [74]:
chipo.select("item_name", "choice_description").show(10)

+--------------------+--------------------+
|           item_name|  choice_description|
+--------------------+--------------------+
|Chips and Fresh T...|                NULL|
|                Izze|        [Clementine]|
|    Nantucket Nectar|             [Apple]|
|Chips and Tomatil...|                NULL|
|        Chicken Bowl|[Tomatillo-Red Ch...|
|        Chicken Bowl|[Fresh Tomato Sal...|
|       Side of Chips|                NULL|
|       Steak Burrito|[Tomatillo Red Ch...|
|    Steak Soft Tacos|[Tomatillo Green ...|
|       Steak Burrito|[Fresh Tomato Sal...|
+--------------------+--------------------+
only showing top 10 rows



In [103]:
s = chipo.filter("choice_description!='NULL'").select("choice_description").groupby("choice_description").count().sort(desc("count"))
s.show(1)

+------------------+-----+
|choice_description|count|
+------------------+-----+
|       [Diet Coke]|  134|
+------------------+-----+
only showing top 1 row



### Step 12. How many items were orderd in total?

In [112]:
chipo.select("order_id").count()

4622

### Step 13. Turn the item price into a float

#### Step 13.a. Check the item price type

In [119]:
chipo.select('item_price').dtypes

[('item_price', 'string')]

#### Step 13.b. Create a lambda function and change the type of item price

In [167]:
chipo.dtypes

[('order_id', 'int'),
 ('quantity', 'int'),
 ('item_name', 'string'),
 ('choice_description', 'string'),
 ('item_price', 'string')]

In [175]:
def strip_dollars(s: str):
    amount = float(s.replace('$', ''))
    return amount

In [198]:
chipo.show()

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|    $2.39 |
|       1|       1|                Izze|        [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|                NULL|    $2.39 |
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |
|       3|       1|       Side of Chips|                NULL|    $1.69 |
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 |
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|    $9.25 |
|       5|       1| Chips and Guacamole|           

In [None]:
def convertType(chipo,s: str):
    for x in range(0,chipo.count()):
        return float(chipo.select(col("item_price")).collect()[x][0].replace('$',''))

In [181]:
df3 = df2.withColumn('price', [strip_dollars(x) for x in df2.col('item_price')])

AttributeError: 'DataFrame' object has no attribute 'col'

In [169]:
df2 = chipo.withColumn('order_id', col('order_id').cast(IntegerType())) \
            .withColumn('quantity', col('quantity').cast(IntegerType())) \
            .withColumn('item_name', col('item_name').cast(StringType())) \
            .withColumn('choice_description', col('choice_description').cast(StringType())) \
            .withColumn('item_price', col('item_price').cast(DoubleType()))

#### Step 13.c. Check the item price type

In [174]:
df2.show(10)

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                NULL|      null|
|       1|       1|                Izze|        [Clementine]|      null|
|       1|       1|    Nantucket Nectar|             [Apple]|      null|
|       1|       1|Chips and Tomatil...|                NULL|      null|
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|      null|
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|      null|
|       3|       1|       Side of Chips|                NULL|      null|
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|      null|
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|      null|
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|      null|
+--------+--------+--------------------+-----------

In [171]:
df2.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- quantity: integer (nullable = true)
 |-- item_name: string (nullable = true)
 |-- choice_description: string (nullable = true)
 |-- item_price: double (nullable = true)



### Step 14. How much was the revenue for the period in the dataset?

In [173]:
df2.select("item_price").show(10)

+----------+
|item_price|
+----------+
|      null|
|      null|
|      null|
|      null|
|      null|
|      null|
|      null|
|      null|
|      null|
|      null|
+----------+
only showing top 10 rows



### Step 15. How many orders were made in the period?

### Step 16. What is the average revenue amount per order?

In [3]:
# Solution 1



In [4]:
# Solution 2



### Step 17. How many different items are sold?