### This notebook will highlight comparison between Python and Pyspark

#### Data Path For Python

In [20]:
orders_path_pd = "/home/labuser/Downloads/orders.csv"
customers_path_pd = "/home/labuser/Downloads/customers.csv"
restaurant_path_pd = "/home/labuser/Downloads/restaurants.csv"
food_path_pd = "/home/labuser/Downloads/food.csv"
orders_details_path_pd = "/home/labuser/Downloads/orders_details.csv"


#### Data Path for Pyspark

In [15]:
orders_path = "file:///home/labuser/Downloads/orders.csv"
orders_details_path = "file:///home/labuser/Downloads/orders_details.csv"
customers_path = "file:///home/labuser/Downloads/customers.csv"
food_path = "file:///home/labuser/Downloads/food.csv"
restaurant_path = "file:///home/labuser/Downloads/restaurants.csv"

### Dataframe API

#### FoodWagon is engaged in the task of data retrieval and analysis of Data

- Read the FoodWagon Data 

##### --> Python



In [4]:
import pandas as pd

In [5]:
# Read orders and customers DataFrames
orders_df_pd = pd.read_csv(orders_path_pd)
customers_df_pd = pd.read_csv(customers_path_pd)

--> Pyspark

#####  -->pyspark installation and importing it into the jupyter notebook

In [None]:
pip install pyspark

In [8]:
pip install findspark

Collecting findspark
  Downloading findspark-2.0.1-py2.py3-none-any.whl (4.4 kB)
Installing collected packages: findspark
Successfully installed findspark-2.0.1
Note: you may need to restart the kernel to use updated packages.


In [6]:
import findspark
findspark.init()

In [7]:
# Import the SparkSession class from the pyspark.sql module
from pyspark.sql import SparkSession

In [8]:
# Create a SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

2023-11-04 13:22:15,545 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [9]:
# Use the SparkSession to read the CSV file into a DataFrame called orders_df
# - set the file format to "csv"
# - set the "header" option to "true" to indicate that the first row of the CSV file contains column names
# - set the "inferSchema" option to "true" to infer the schema of the DataFrame from the CSV file


orders_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(orders_path)

                                                                                

### Q. Find orders with a restaurant_rating greater than 4.5

--> Python

In [10]:
# Filter orders with restaurant_rating greater than 4.5
high_rated_orders = orders_df_pd[orders_df_pd["restaurant_ratings"] > 4.5]

# Display the resulting DataFrame
high_rated_orders

Unnamed: 0,customer_ids_orders,restaurant_ids_orders,restaurant_ratings,order_date,delivery_time,delivery_rating,partner_ids_orders,order_id
0,1000746314,193075,5.0,2019-09-30,96.157046,5.0,1000108610,10000000001
1,1003950373,158013,5.0,2019-03-31,40.521030,5.0,1000161663,10000000002
2,1002362211,170560,5.0,2022-09-04,39.526428,5.0,1000102688,10000000003
3,1004568851,187316,5.0,2019-02-05,86.926029,5.0,1000102818,10000000004
4,1001843660,183296,5.0,2019-06-22,41.853198,5.0,1000027844,10000000005
...,...,...,...,...,...,...,...,...
9999993,1002574387,154295,5.0,2021-03-02,100.892908,5.0,1000141009,10009999994
9999995,1001509362,190550,4.7,2019-04-09,59.505508,5.0,1000126020,10009999996
9999997,1002140393,191333,5.0,2020-06-12,34.390790,5.0,1000147200,10009999998
9999998,1003343496,146043,5.0,2022-05-09,60.699148,5.0,1000147737,10009999999


--> Pyspark

In [11]:
# Filter orders with restaurant_rating greater than 4.5
high_rated_orders = orders_df.filter(orders_df["restaurant_ratings"] > 4.5)

# Show the resulting DataFrame
high_rated_orders.show()

+-------------------+---------------------+------------------+----------+------------------+---------------+------------------+-----------+
|customer_ids_orders|restaurant_ids_orders|restaurant_ratings|order_date|     delivery_time|delivery_rating|partner_ids_orders|   order_id|
+-------------------+---------------------+------------------+----------+------------------+---------------+------------------+-----------+
|         1000746314|               193075|               5.0|2019-09-30| 96.15704575574165|            5.0|        1000108610|10000000001|
|         1003950373|               158013|               5.0|2019-03-31| 40.52103039895692|            5.0|        1000161663|10000000002|
|         1002362211|               170560|               5.0|2022-09-04|39.526427646585674|            5.0|        1000102688|10000000003|
|         1004568851|               187316|               5.0|2019-02-05|   86.926029233531|            5.0|        1000102818|10000000004|
|         1001843660

#### Q. FoodWagon is offering a 10% discount on all food items and the The sales team wants to analyze the impact of this discount on the prices of food items. 

--> Python

In [12]:
food_df_pd = pd.read_csv(food_path_pd)

In [13]:
# Add a column for discounted price (10% discount)
food_df_pd["discounted_price"] = food_df_pd["food_price"] * 0.9

# Display the food data with original and discounted prices
food_df_pd[["food_menu", "food_name", "food_price", "discounted_price"]]

Unnamed: 0,food_menu,food_name,food_price,discounted_price
0,Burger,Aloo Tikki Burger,40,36.0
1,Burger,Veg Creamy Burger,50,45.0
2,Burger,Cheese Burst Burger,65,58.5
3,Burger,Paneer Creamy Burger,80,72.0
4,Burger,Maxican Burger,80,72.0
...,...,...,...,...
2951734,Fresh Base Pizza,7''cheese & Paneer Pizza,185,166.5
2951735,Fresh Base Pizza,7''fresh Farm Pizza,210,189.0
2951736,Fresh Base Pizza,7''paneer & Onion Pizza,199,179.1
2951737,Fresh Base Pizza,7''veggie Delight Pizza,260,234.0


--> Pyspark

In [16]:
food_df = spark.read.format("csv")\
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(food_path)

                                                                                

In [17]:
from pyspark.sql.functions import col

In [18]:
# Add a column for discounted price
discounted_food_data = food_df.withColumn("discounted_price", col("food_price") * 0.9)

# Display the original and discounted prices
result = discounted_food_data.select("food_menu", "food_name", "food_price", "discounted_price")

# Show the resulting DataFrame
result.show()

+-----------------+--------------------+----------+----------------+
|        food_menu|           food_name|food_price|discounted_price|
+-----------------+--------------------+----------+----------------+
|           Burger|   Aloo Tikki Burger|        40|            36.0|
|           Burger|   Veg Creamy Burger|        50|            45.0|
|           Burger| Cheese Burst Burger|        65|            58.5|
|           Burger|Paneer Creamy Burger|        80|            72.0|
|           Burger|      Maxican Burger|        80|            72.0|
|           Burger|  Bbq Chicken Burger|       105|            94.5|
|           Burger|Peri Peri Chicken...|       105|            94.5|
|   Pasta Must Try|         White Sauce|       100|            90.0|
|   Pasta Must Try|           Red Sauce|       100|            90.0|
|   Pasta Must Try|          Pink Sauce|       125|           112.5|
|   Pasta Must Try|        Masala Pasta|       100|            90.0|
|   Pasta Must Try|       Chicken 

### Q. FoodWagon is optimizing its order details records and wants to improve column naming. The team decided to rename the "food_qty" column to "food_quantity" in the "order_details" data.

--> Python

In [21]:
orders_details_df_pd = pd.read_csv(orders_details_path_pd)

In [22]:
# Rename the "food_qty" column to "food_quantity"
orders_details_df_pd.rename(columns={"food_qty": "food_quantity"}, inplace=True)

# Display the updated DataFrame
orders_details_df_pd

Unnamed: 0,order_ids_detail,food_ids_detail,food_quantity,detail_id
0,10002966982,1002945683,2,10000000001
1,10001111117,1000876144,5,10000000002
2,10006755508,1000045611,2,10000000003
3,10001799109,1001827128,2,10000000004
4,10006713375,1001514543,1,10000000005
...,...,...,...,...
9999994,10009825064,1001379137,2,10009999995
9999995,10005529493,1002873454,1,10009999996
9999996,10005981894,1002459514,1,10009999997
9999997,10001137401,1002834502,2,10009999998


--> Pyspark

In [23]:
orders_details_df = spark.read.format("csv")\
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(orders_details_path)

                                                                                

In [24]:
# Rename the "food_qty" column to "food_quantity"
renamed_order_details_data = orders_details_df.withColumnRenamed("food_qty", "food_quantity")

# Display the resulting DataFrame
renamed_order_details_data.show()

+----------------+---------------+-------------+-----------+
|order_ids_detail|food_ids_detail|food_quantity|  detail_id|
+----------------+---------------+-------------+-----------+
|     10002966982|     1002945683|            2|10000000001|
|     10001111117|     1000876144|            5|10000000002|
|     10006755508|     1000045611|            2|10000000003|
|     10001799109|     1001827128|            2|10000000004|
|     10006713375|     1001514543|            1|10000000005|
|     10002120033|     1000355682|            1|10000000006|
|     10000579611|     1002709653|            1|10000000007|
|     10004412998|     1001462406|            1|10000000008|
|     10009790293|     1002305816|            1|10000000009|
|     10004582144|     1002904452|            4|10000000010|
|     10001990806|     1001260796|            2|10000000011|
|     10001614375|     1000850066|            2|10000000012|
|     10005846269|     1000283204|            1|10000000013|
|     10005217446|     1

#### Q. FoodWagon wants to drop the licension number field

--> Python

In [28]:
restaurants_df_pd = pd.read_csv(restaurant_path_pd)

In [29]:
# Drop the "licension no" column
restaurants_df_pd.drop(columns=["licension no"], inplace=True)

# Display the updated DataFrame
restaurants_df_pd

Unnamed: 0,restaurant_city,restaurant_name,cuisine,restaurant_id
0,Abohar,AB FOODS POINT,Beverages and Pizzas,100001
1,Abohar,Janta Sweet House,Sweets and Bakery,100002
2,Abohar,theka coffee desi,Beverages,100003
3,Abohar,Singh Hut,Fast Food and Indian,100004
4,Abohar,GRILL MASTERS,Italian-American and Fast Food,100005
...,...,...,...,...
120679,Yavatmal,The Food Delight,Fast Food and Snacks,220680
120680,Yavatmal,MAITRI FOODS & BEVERAGES,Pizzas,220681
120681,Yavatmal,Cafe Bella Ciao,Fast Food and Snacks,220682
120682,Yavatmal,GRILL ZILLA,Continental,220683


--> Pyspark

In [30]:
restaurants_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(restaurant_path)

In [31]:
# Drop the "licension no" column
restaurant_data = restaurants_df.drop("licension no")

# Display the resulting DataFrame
restaurant_data.show()


+---------------+--------------------+--------------------+-------------+
|restaurant_city|     restaurant_name|             cuisine|restaurant_id|
+---------------+--------------------+--------------------+-------------+
|         Abohar|      AB FOODS POINT|Beverages and Pizzas|       100001|
|         Abohar|   Janta Sweet House|   Sweets and Bakery|       100002|
|         Abohar|   theka coffee desi|           Beverages|       100003|
|         Abohar|           Singh Hut|Fast Food and Indian|       100004|
|         Abohar|       GRILL MASTERS|Italian-American ...|       100005|
|         Abohar|           Sam Uncle|         Continental|       100006|
|         Abohar|    shere punjab veg|        North Indian|       100007|
|         Abohar|Shri Balaji Vaish...|        North Indian|       100008|
|         Abohar|Hinglaj Kachori B...|    Snacks and Chaat|       100009|
|         Abohar|           yummy hub|              Indian|       100010|
|         Abohar|CHAWLA SAAB THE J...|

#### Q. In pursuit of menu optimization, FoodWagon is currently scrutinizing the food menu, requiring distinct food_menu names.

--> Python

In [32]:
# Get unique food_menu values
unique_food_menus = food_df_pd["food_menu"].unique()

# Display the unique food_menu values
unique_food_menus

array(['Burger', 'Pasta Must Try', 'Chiness Appetizer', ...,
       'Tandoori Roti and Naan and Paratha', 'Flaoters',
       'Bella Ciao Special'], dtype=object)

--> Pyspark

In [33]:
# Get unique food_menu values
unique_food_menus = food_df.select("food_menu").distinct()

# Show the unique food_menu values
unique_food_menus.show(truncate=False)

[Stage 12:>                                                         (0 + 8) / 8]

+------------------------------+
|food_menu                     |
+------------------------------+
|Kidzee                        |
|Fried Rice                    |
|Fresh Fruit and Dry Fruit     |
|All Day Feast                 |
|Special Stuffing Dosa         |
|Pickles                       |
|Hyderabadi Biryani            |
|Scratch Speciality            |
|Platers                       |
|Starter For Vegetarian        |
|Italian Gelato (Low Fat)      |
|Deserts & Drinks              |
|Continental & Desserts        |
|Flavours Curated by ITC Mughal|
|Festive Box [Savings upto 15%]|
|Special Veg Fried Items       |
|Burger Singh                  |
|Apperitzers / Starter         |
|Veg Breakfast                 |
|Hot Bevrage                   |
+------------------------------+
only showing top 20 rows



                                                                                

#### Q. calculate the average delivery rating for each restaurant.

--> Python

In [34]:
# Group orders by restaurant and calculate the average delivery rating
average_delivery_ratings = orders_df_pd.groupby("restaurant_ids_orders")["delivery_rating"].mean()

# Display the average delivery ratings
average_delivery_ratings

restaurant_ids_orders
100001    5.0
100002    5.0
100003    5.0
100004    5.0
100005    5.0
         ... 
220680    5.0
220681    5.0
220682    5.0
220683    5.0
220684    5.0
Name: delivery_rating, Length: 120684, dtype: float64

--> Pyspark

In [35]:
# Group orders by restaurant and calculate the average delivery rating
average_delivery_ratings = orders_df.groupBy("restaurant_ids_orders").agg({"delivery_rating": "avg"})

# Show the average delivery ratings
average_delivery_ratings.show()

[Stage 14:>                                                         (0 + 8) / 8]

+---------------------+--------------------+
|restaurant_ids_orders|avg(delivery_rating)|
+---------------------+--------------------+
|               161234|                 5.0|
|               197148|                 5.0|
|               172959|   4.994871794871795|
|               117994|  4.9975000000000005|
|               185705|                 5.0|
|               185506|   4.997402597402598|
|               156749|   4.985714285714286|
|               160767|                 5.0|
|               204529|                 5.0|
|               149177|                 5.0|
|               165829|   4.993478260869565|
|               131931|                 5.0|
|               160820|                 5.0|
|               154033|                 5.0|
|               180753|                 5.0|
|               216014|   4.992307692307692|
|               147961|                 5.0|
|               146224|                 5.0|
|               206378|                 5.0|
|         



#### Q. FoodWagon wants to analyze the average delivery rating of orders for each cuisine type.

--> Python

In [36]:
# Perform the join operation
joined_data = pd.merge(orders_df_pd, restaurants_df_pd, left_on="restaurant_ids_orders", right_on="restaurant_id")

# Calculate the average delivery rating for each cuisine type
avg_delivery_by_cuisine = joined_data.groupby("cuisine")["delivery_rating"].mean()

# Display the resulting report
print(avg_delivery_by_cuisine)

cuisine
8:15 To 11:30 Pm                                                   4.999695
Afghani                                                            4.999738
Afghani and  Biryani and  Chinese and  Indian and  North Indian    5.000000
Afghani and  Fast Food                                             5.000000
Afghani and  North Indian and  Indian and  Beverages               5.000000
                                                                     ...   
Waffle and Chinese                                                 5.000000
Waffle and Desserts                                                4.998671
Waffle and Fast Food                                               5.000000
Waffle and Ice Cream                                               4.996912
Waffle and Snacks                                                  4.997664
Name: delivery_rating, Length: 5428, dtype: float64


--> Pyspark

In [37]:
# Perform the join operation
joined_data = orders_df.join(restaurant_data, orders_df["restaurant_ids_orders"] == restaurant_data["restaurant_id"])

# Calculate the average delivery rating for each cuisine type
avg_delivery_by_cuisine = joined_data.groupBy("cuisine").agg({"delivery_rating": "avg"})

# Show the resulting report
avg_delivery_by_cuisine.show()

[Stage 17:>                                                         (0 + 8) / 8]

+--------------------+--------------------+
|             cuisine|avg(delivery_rating)|
+--------------------+--------------------+
| Arabian and Chinese|  4.9979650032404415|
|Fast Food and Indian|   4.998508146306353|
|South Indian and ...|   4.998376982663224|
|    Pizzas and Combo|   4.999290420679168|
|North Indian and ...|  4.9990476190476185|
|Arabian and  Indi...|   4.999408284023668|
|Chinese and  Tand...|                 5.0|
|Fast Food and Bur...|   4.999497319034852|
|Lebanese and Fast...|   4.998317417007732|
|American and Stre...|   4.994495412844037|
| Ice Cream and Chaat|    4.99980732177264|
|Beverages and  In...|   4.998677248677248|
|Snacks and  Pizza...|                 5.0|
|Snacks and  Desserts|                 5.0|
|South Indian and ...|   4.999236641221374|
|     Indian and Cafe|                 5.0|
|Thalis and  Rajas...|   4.995945945945945|
|   Snacks and Haleem|                 5.0|
|Ice Cream and  It...|                 5.0|
|Italian-American ...|          



### Spark SQL API 

- The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.

### FoodWagon wants to run SQL queries on top of the Data. 
- Create dataframe
- Create TempView on top of this Dataframe
- Run SQL query

In [None]:
restaurants_df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load(restaurant_path)

##### In Spark, the createOrReplaceTempView method is used to create a temporary view from a DataFrame. A temporary view is a named table-like structure that allows you to execute SQL queries on DataFrame data using Spark's SQL API.

In [None]:
restaurants_df.createOrReplaceTempView("restaurant")

-->Let's see all records from restaurant

In [None]:
result_df = spark.sql("SELECT * FROM restaurant")
result_df.show(truncate=False)

####  FoodWagon is identifying the partner restaurants that provide this particular cuisine.

In [None]:
filtered_df = spark.sql(""" SELECT * FROM restaurant where cuisine = 'Beverages and Pizzas' """)
filtered_df.show()

#### FoodWagon is in the process of pinpointing the cities from the restaurant data

In [None]:
aggregated_df = spark.sql(""" SELECT restaurant_city, count(*) as city_count FROM restaurant \
                        group by restaurant_city """)
aggregated_df.show()

- This is how FoodWagon can run SQL Query by creating TempView. But Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates 
- If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a **Global Temporary View.**

In [None]:
# Register the DataFrame as a global temporary view
restaurants_df.createGlobalTempView("restaurant_glob")

In [None]:
## Global temporary view is tied to a system preserved database `global_temp`
spark.sql("SELECT * FROM global_temp.restaurant_glob").show()