# Ex2 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("My Application").getOrCreate()

25/05/07 13:00:23 WARN Utils: Your hostname, neosoft-Latitude-E7270 resolves to a loopback address: 127.0.1.1; using 10.0.62.133 instead (on interface wlp1s0)
25/05/07 13:00:23 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/07 13:00:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv). 

### Step 3. Assign it to a variable called chipo.

In [5]:
from pyspark.sql.functions import *
from pyspark.sql.types import *

import pandas as pd

url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"
pandas_df = pd.read_csv(url, sep="\t")

chipo = spark.createDataFrame(pandas_df)

chipo.show()


                                                                                

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                 NaN|    $2.39 |
|       1|       1|                Izze|        [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|                 NaN|    $2.39 |
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |
|       3|       1|       Side of Chips|                 NaN|    $1.69 |
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 |
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|    $9.25 |
|       5|       1| Chips and Guacamole|           

### Step 4. See the first 10 entries

In [6]:
chipo.show(10)

+--------+--------+--------------------+--------------------+----------+
|order_id|quantity|           item_name|  choice_description|item_price|
+--------+--------+--------------------+--------------------+----------+
|       1|       1|Chips and Fresh T...|                 NaN|    $2.39 |
|       1|       1|                Izze|        [Clementine]|    $3.39 |
|       1|       1|    Nantucket Nectar|             [Apple]|    $3.39 |
|       1|       1|Chips and Tomatil...|                 NaN|    $2.39 |
|       2|       2|        Chicken Bowl|[Tomatillo-Red Ch...|   $16.98 |
|       3|       1|        Chicken Bowl|[Fresh Tomato Sal...|   $10.98 |
|       3|       1|       Side of Chips|                 NaN|    $1.69 |
|       4|       1|       Steak Burrito|[Tomatillo Red Ch...|   $11.75 |
|       4|       1|    Steak Soft Tacos|[Tomatillo Green ...|    $9.25 |
|       5|       1|       Steak Burrito|[Fresh Tomato Sal...|    $9.25 |
+--------+--------+--------------------+-----------

### Step 5. What is the number of observations in the dataset?

In [7]:
# Solution 1

chipo.count()

4622

In [8]:
# Solution 2

chipo.createOrReplaceTempView("chipo_tbl")

spark.sql(
    """
    select count(*)
    from chipo_tbl
    """
).show()

+--------+
|count(1)|
+--------+
|    4622|
+--------+



### Step 6. What is the number of columns in the dataset?

In [9]:
len(chipo.columns)

5

### Step 7. Print the name of all the columns.

In [10]:
chipo.columns

['order_id', 'quantity', 'item_name', 'choice_description', 'item_price']

### Step 8. How is the dataset indexed?

### Step 9. Which was the most-ordered item? 

In [11]:

chipo.groupBy("item_name").agg(count("item_name").alias("countt")).orderBy(desc("countt")).show(1)

+------------+------+
|   item_name|countt|
+------------+------+
|Chicken Bowl|   726|
+------------+------+
only showing top 1 row



### Step 10. For the most-ordered item, how many items were ordered?

In [12]:
chipo.groupBy("item_name").agg(count("item_name").alias("countt"), sum("quantity")).orderBy(desc("countt")).show(1)

+------------+------+-------------+
|   item_name|countt|sum(quantity)|
+------------+------+-------------+
|Chicken Bowl|   726|          761|
+------------+------+-------------+
only showing top 1 row



### Step 11. What was the most ordered item in the choice_description column?

In [13]:
chipo.groupBy('choice_description').agg(count("choice_description").alias("description")).orderBy(desc('description')).show(1)

+------------------+-----------+
|choice_description|description|
+------------------+-----------+
|               NaN|       1246|
+------------------+-----------+
only showing top 1 row



### Step 12. How many items were orderd in total?

In [14]:
chipo.select('item_name').distinct().count()

50

### Step 13. Turn the item price into a float

#### Step 13.a. Check the item price type

In [15]:
chipo.select("item_price").dtypes

[('item_price', 'string')]

#### Step 13.b. Create a lambda function and change the type of item price

In [16]:
from pyspark.sql.functions import col, regexp_replace

chipo_clean = chipo.withColumn("item_price_clean", regexp_replace("item_price", "[$]", ""))

chipo_clean = chipo_clean.withColumn("item_price_float", col("item_price_clean").cast("float"))

chipo_clean.select("item_price", "item_price_clean", "item_price_float").show(5)


+----------+----------------+----------------+
|item_price|item_price_clean|item_price_float|
+----------+----------------+----------------+
|    $2.39 |           2.39 |            2.39|
|    $3.39 |           3.39 |            3.39|
|    $3.39 |           3.39 |            3.39|
|    $2.39 |           2.39 |            2.39|
|   $16.98 |          16.98 |           16.98|
+----------+----------------+----------------+
only showing top 5 rows



#### Step 13.c. Check the item price type

In [17]:
chipo.select("item_price").dtypes

[('item_price', 'string')]

### Step 14. How much was the revenue for the period in the dataset?

In [23]:
chipo_clean.agg(round(sum(col("item_price_clean")), 2)).show()

+-------------------------------+
|round(sum(item_price_clean), 2)|
+-------------------------------+
|                       34500.16|
+-------------------------------+



### Step 15. How many orders were made in the period?

In [24]:
chipo.agg(sum("quantity")).show()

+-------------+
|sum(quantity)|
+-------------+
|         4972|
+-------------+



### Step 16. What is the average revenue amount per order?

In [26]:
# Solution 1

chipo_clean.agg(round(avg("item_price_clean"),2)).show()

+-------------------------------+
|round(avg(item_price_clean), 2)|
+-------------------------------+
|                           7.46|
+-------------------------------+



In [28]:
# Solution 2
chipo_clean.createOrReplaceTempView('chipo_table')

spark.sql(
    """
    select avg(item_price_clean)
    from chipo_table
    """
).show()

+---------------------+
|avg(item_price_clean)|
+---------------------+
|    7.464335785374397|
+---------------------+



### Step 17. How many different items are sold?

In [33]:
chipo.groupBy("item_name").agg(count("item_name").alias("count")).orderBy(desc("count")).show()

+--------------------+-----+
|           item_name|count|
+--------------------+-----+
|        Chicken Bowl|  726|
|     Chicken Burrito|  553|
| Chips and Guacamole|  479|
|       Steak Burrito|  368|
|   Canned Soft Drink|  301|
|               Chips|  211|
|          Steak Bowl|  211|
|       Bottled Water|  162|
|  Chicken Soft Tacos|  115|
|  Chicken Salad Bowl|  110|
|Chips and Fresh T...|  110|
|         Canned Soda|  104|
|       Side of Chips|  101|
|      Veggie Burrito|   95|
|    Barbacoa Burrito|   91|
|         Veggie Bowl|   85|
|       Carnitas Bowl|   68|
|       Barbacoa Bowl|   66|
|    Carnitas Burrito|   59|
|    Steak Soft Tacos|   55|
+--------------------+-----+
only showing top 20 rows

