**Master DataFrame Assignment** – Retail Sales Superstore Dataset

Step 1: Sample Dataset (create CSV)

In [18]:
import pandas as pd

data = """OrderID,OrderDate,Customer,Segment,Region,Product,Category,SubCategory,Quantity,UnitPrice,Discount,Profit
CA-1001,2023-01-15,Ravi,Consumer,South,Laptop,Technology,Computers,1,55000,0.10,5000
CA-1002,2023-02-20,Priya,Corporate,North,Printer,Technology,Peripherals,2,12000,0.15,1800
CA-1003,2023-01-25,Amit,Consumer,East,Notebook,Office Supplies,Paper,3,200,0.05,150
CA-1004,2023-03-01,Anita,Home Office,West,Table,Furniture,Tables,1,18000,0.20,-1500
CA-1005,2023-02-05,Divya,Consumer,South,Phone,Technology,Phones,2,20000,0.00,3000
"""

with open('superstore.csv', 'w') as file:
    file.write(data)

**PART 1: Pandas DataFrame Operations**

1. Load the CSV using pandas .

In [19]:
df = pd.read_csv('superstore.csv')
print(df)

   OrderID   OrderDate Customer      Segment Region   Product  \
0  CA-1001  2023-01-15     Ravi     Consumer  South    Laptop   
1  CA-1002  2023-02-20    Priya    Corporate  North   Printer   
2  CA-1003  2023-01-25     Amit     Consumer   East  Notebook   
3  CA-1004  2023-03-01    Anita  Home Office   West     Table   
4  CA-1005  2023-02-05    Divya     Consumer  South     Phone   

          Category  SubCategory  Quantity  UnitPrice  Discount  Profit  
0       Technology    Computers         1      55000      0.10    5000  
1       Technology  Peripherals         2      12000      0.15    1800  
2  Office Supplies        Paper         3        200      0.05     150  
3        Furniture       Tables         1      18000      0.20   -1500  
4       Technology       Phones         2      20000      0.00    3000  


2. Print schema, head, shape, dtypes.

In [20]:
print(df.head())

   OrderID   OrderDate Customer      Segment Region   Product  \
0  CA-1001  2023-01-15     Ravi     Consumer  South    Laptop   
1  CA-1002  2023-02-20    Priya    Corporate  North   Printer   
2  CA-1003  2023-01-25     Amit     Consumer   East  Notebook   
3  CA-1004  2023-03-01    Anita  Home Office   West     Table   
4  CA-1005  2023-02-05    Divya     Consumer  South     Phone   

          Category  SubCategory  Quantity  UnitPrice  Discount  Profit  
0       Technology    Computers         1      55000      0.10    5000  
1       Technology  Peripherals         2      12000      0.15    1800  
2  Office Supplies        Paper         3        200      0.05     150  
3        Furniture       Tables         1      18000      0.20   -1500  
4       Technology       Phones         2      20000      0.00    3000  


In [21]:
print(df.shape)

(5, 12)


In [22]:
print(df.dtypes)

OrderID         object
OrderDate       object
Customer        object
Segment         object
Region          object
Product         object
Category        object
SubCategory     object
Quantity         int64
UnitPrice        int64
Discount       float64
Profit           int64
dtype: object


3. Select Customer , Product , Profit columns.

In [23]:
value = df[['Customer', 'Product', 'Profit']]
print(value)

  Customer   Product  Profit
0     Ravi    Laptop    5000
1    Priya   Printer    1800
2     Amit  Notebook     150
3    Anita     Table   -1500
4    Divya     Phone    3000


4. Filter orders where Profit > 2000 and Discount = 0

In [29]:
values = df[(df['Profit'] > 2000) & (df['Discount'] == 0)]
print(values)

   OrderID   OrderDate Customer   Segment Region Product    Category  \
4  CA-1005  2023-02-05    Divya  Consumer  South   Phone  Technology   

  SubCategory  Quantity  UnitPrice  Discount  Profit  
4      Phones         2      20000       0.0    3000  


5. Sort by Profit descending.

In [30]:
sort = df.sort_values(by= 'Profit', ascending=False)
print(sort)

   OrderID   OrderDate Customer      Segment Region   Product  \
0  CA-1001  2023-01-15     Ravi     Consumer  South    Laptop   
4  CA-1005  2023-02-05    Divya     Consumer  South     Phone   
1  CA-1002  2023-02-20    Priya    Corporate  North   Printer   
2  CA-1003  2023-01-25     Amit     Consumer   East  Notebook   
3  CA-1004  2023-03-01    Anita  Home Office   West     Table   

          Category  SubCategory  Quantity  UnitPrice  Discount  Profit  
0       Technology    Computers         1      55000      0.10    5000  
4       Technology       Phones         2      20000      0.00    3000  
1       Technology  Peripherals         2      12000      0.15    1800  
2  Office Supplies        Paper         3        200      0.05     150  
3        Furniture       Tables         1      18000      0.20   -1500  


6. GroupBy Category → Total Profit, Avg Discount.

In [33]:
grouped = df.groupby('Category').agg({
    'Profit': 'sum',
    'Discount': 'mean'
})
print(grouped)

                 Profit  Discount
Category                         
Furniture         -1500  0.200000
Office Supplies     150  0.050000
Technology         9800  0.083333


7. Add a column TotalPrice = Quantity * UnitPrice .

In [34]:
tot = df['Quantity']* df['UnitPrice']
df['TotalPrice'] = tot
print(df)

   OrderID   OrderDate Customer      Segment Region   Product  \
0  CA-1001  2023-01-15     Ravi     Consumer  South    Laptop   
1  CA-1002  2023-02-20    Priya    Corporate  North   Printer   
2  CA-1003  2023-01-25     Amit     Consumer   East  Notebook   
3  CA-1004  2023-03-01    Anita  Home Office   West     Table   
4  CA-1005  2023-02-05    Divya     Consumer  South     Phone   

          Category  SubCategory  Quantity  UnitPrice  Discount  Profit  \
0       Technology    Computers         1      55000      0.10    5000   
1       Technology  Peripherals         2      12000      0.15    1800   
2  Office Supplies        Paper         3        200      0.05     150   
3        Furniture       Tables         1      18000      0.20   -1500   
4       Technology       Phones         2      20000      0.00    3000   

   TotalPrice  
0       55000  
1       24000  
2         600  
3       18000  
4       40000  


8. Drop the SubCategory column.

In [35]:
df.drop('SubCategory', axis = 1, inplace = True)
print(df)

   OrderID   OrderDate Customer      Segment Region   Product  \
0  CA-1001  2023-01-15     Ravi     Consumer  South    Laptop   
1  CA-1002  2023-02-20    Priya    Corporate  North   Printer   
2  CA-1003  2023-01-25     Amit     Consumer   East  Notebook   
3  CA-1004  2023-03-01    Anita  Home Office   West     Table   
4  CA-1005  2023-02-05    Divya     Consumer  South     Phone   

          Category  Quantity  UnitPrice  Discount  Profit  TotalPrice  
0       Technology         1      55000      0.10    5000       55000  
1       Technology         2      12000      0.15    1800       24000  
2  Office Supplies         3        200      0.05     150         600  
3        Furniture         1      18000      0.20   -1500       18000  
4       Technology         2      20000      0.00    3000       40000  


9. Fill nulls in Discount with 0.10.

In [38]:
df['Discount'] = df['Discount'].fillna(0.10)
print(df)

   OrderID   OrderDate Customer      Segment Region   Product  \
0  CA-1001  2023-01-15     Ravi     Consumer  South    Laptop   
1  CA-1002  2023-02-20    Priya    Corporate  North   Printer   
2  CA-1003  2023-01-25     Amit     Consumer   East  Notebook   
3  CA-1004  2023-03-01    Anita  Home Office   West     Table   
4  CA-1005  2023-02-05    Divya     Consumer  South     Phone   

          Category  Quantity  UnitPrice  Discount  Profit  TotalPrice  
0       Technology         1      55000      0.10    5000       55000  
1       Technology         2      12000      0.15    1800       24000  
2  Office Supplies         3        200      0.05     150         600  
3        Furniture         1      18000      0.20   -1500       18000  
4       Technology         2      20000      0.00    3000       40000  


10. Apply a function to categorize orders:

In [39]:
def classify(row):
    if row['Profit'] > 4000:
        return 'High'
    elif row['Profit'] > 0:
        return 'Medium'
    else:
        return 'Low'

df['OrderCategory'] = df.apply(classify, axis=1)
print(df)

   OrderID   OrderDate Customer      Segment Region   Product  \
0  CA-1001  2023-01-15     Ravi     Consumer  South    Laptop   
1  CA-1002  2023-02-20    Priya    Corporate  North   Printer   
2  CA-1003  2023-01-25     Amit     Consumer   East  Notebook   
3  CA-1004  2023-03-01    Anita  Home Office   West     Table   
4  CA-1005  2023-02-05    Divya     Consumer  South     Phone   

          Category  Quantity  UnitPrice  Discount  Profit  TotalPrice  \
0       Technology         1      55000      0.10    5000       55000   
1       Technology         2      12000      0.15    1800       24000   
2  Office Supplies         3        200      0.05     150         600   
3        Furniture         1      18000      0.20   -1500       18000   
4       Technology         2      20000      0.00    3000       40000   

  OrderCategory  
0          High  
1        Medium  
2        Medium  
3           Low  
4        Medium  


**PART 2: PySpark DataFrame Operations**

1. Load the same CSV using PySpark.

In [48]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RetailSalesSuperstore").getOrCreate()
spark

In [43]:
csv_data = """OrderID,OrderDate,Customer,Segment,Region,Product,Category,SubCategory,Quantity,UnitPrice,Discount,Profit
CA-1001,2023-01-15,Ravi,Consumer,South,Laptop,Technology,Computers,1,55000,0.10,5000
CA-1002,2023-02-20,Priya,Corporate,North,Printer,Technology,Peripherals,2,12000,0.15,1800
CA-1003,2023-01-25,Amit,Consumer,East,Notebook,Office Supplies,Paper,3,200,0.05,150
CA-1004,2023-03-01,Anita,Home Office,West,Table,Furniture,Tables,1,18000,0.20,-1500
CA-1005,2023-02-05,Divya,Consumer,South,Phone,Technology,Phones,2,20000,0.00,3000
"""

with open("superstore.csv", "w") as f:
    f.write(csv_data)

df = spark.read.csv("superstore.csv", header=True, inferSchema=True)
df.show()

+-------+----------+--------+-----------+------+--------+---------------+-----------+--------+---------+--------+------+
|OrderID| OrderDate|Customer|    Segment|Region| Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|
+-------+----------+--------+-----------+------+--------+---------------+-----------+--------+---------+--------+------+
|CA-1001|2023-01-15|    Ravi|   Consumer| South|  Laptop|     Technology|  Computers|       1|    55000|     0.1|  5000|
|CA-1002|2023-02-20|   Priya|  Corporate| North| Printer|     Technology|Peripherals|       2|    12000|    0.15|  1800|
|CA-1003|2023-01-25|    Amit|   Consumer|  East|Notebook|Office Supplies|      Paper|       3|      200|    0.05|   150|
|CA-1004|2023-03-01|   Anita|Home Office|  West|   Table|      Furniture|     Tables|       1|    18000|     0.2| -1500|
|CA-1005|2023-02-05|   Divya|   Consumer| South|   Phone|     Technology|     Phones|       2|    20000|     0.0|  3000|
+-------+----------+--------+---

2. Show schema and first 5 rows.

In [45]:
df.printSchema()
df.show(5)

root
 |-- OrderID: string (nullable = true)
 |-- OrderDate: date (nullable = true)
 |-- Customer: string (nullable = true)
 |-- Segment: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Product: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- SubCategory: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- UnitPrice: integer (nullable = true)
 |-- Discount: double (nullable = true)
 |-- Profit: integer (nullable = true)

+-------+----------+--------+-----------+------+--------+---------------+-----------+--------+---------+--------+------+
|OrderID| OrderDate|Customer|    Segment|Region| Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|
+-------+----------+--------+-----------+------+--------+---------------+-----------+--------+---------+--------+------+
|CA-1001|2023-01-15|    Ravi|   Consumer| South|  Laptop|     Technology|  Computers|       1|    55000|     0.1|  5000|
|CA-1002|2023-02-20|   Priya|  

3. Select columns, Rename Customer → Client .

In [50]:
from pyspark.sql.functions import col

renamed = df.select(col("Customer").alias("Client"))
renamed.show()

+------+
|Client|
+------+
|  Ravi|
| Priya|
|  Amit|
| Anita|
| Divya|
+------+



4. Filter Segment = 'Consumer' and Profit < 1000 .

In [53]:
filtered = df.filter((col('Segment') == 'Consumer') & (col('Profit') <1000))
filtered.show()

+-------+----------+--------+--------+------+--------+---------------+-----------+--------+---------+--------+------+
|OrderID| OrderDate|Customer| Segment|Region| Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|
+-------+----------+--------+--------+------+--------+---------------+-----------+--------+---------+--------+------+
|CA-1003|2023-01-25|    Amit|Consumer|  East|Notebook|Office Supplies|      Paper|       3|      200|    0.05|   150|
+-------+----------+--------+--------+------+--------+---------------+-----------+--------+---------+--------+------+



5. GroupBy Region and show average profit.

In [55]:
grouped_region = df.groupBy('Region').agg({'Profit': 'mean'})
grouped_region.show()

+------+-----------+
|Region|avg(Profit)|
+------+-----------+
| South|     4000.0|
|  East|      150.0|
|  West|    -1500.0|
| North|     1800.0|
+------+-----------+



6. Use withColumn to create TotalPrice = Quantity * UnitPrice .

In [56]:
total = df.withColumn('TotalPrice', col('Quantity')*col('UnitPrice'))
total.show()

+-------+----------+--------+-----------+------+--------+---------------+-----------+--------+---------+--------+------+----------+
|OrderID| OrderDate|Customer|    Segment|Region| Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|TotalPrice|
+-------+----------+--------+-----------+------+--------+---------------+-----------+--------+---------+--------+------+----------+
|CA-1001|2023-01-15|    Ravi|   Consumer| South|  Laptop|     Technology|  Computers|       1|    55000|     0.1|  5000|     55000|
|CA-1002|2023-02-20|   Priya|  Corporate| North| Printer|     Technology|Peripherals|       2|    12000|    0.15|  1800|     24000|
|CA-1003|2023-01-25|    Amit|   Consumer|  East|Notebook|Office Supplies|      Paper|       3|      200|    0.05|   150|       600|
|CA-1004|2023-03-01|   Anita|Home Office|  West|   Table|      Furniture|     Tables|       1|    18000|     0.2| -1500|     18000|
|CA-1005|2023-02-05|   Divya|   Consumer| South|   Phone|     Technology|   

7. Use when().otherwise() to classify Profit as:
'Profit' > 2000 → 'High'
'Profit' <= 0 → 'Loss'
else 'Medium'

In [57]:
from pyspark.sql.functions import when

profits = df.withColumn('ProfitCategory', when((col('Profit') > 2000), 'High') .when((col('Profit') <= 0), 'Loss') .otherwise('Medium'))
profits.show()

+-------+----------+--------+-----------+------+--------+---------------+-----------+--------+---------+--------+------+--------------+
|OrderID| OrderDate|Customer|    Segment|Region| Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|ProfitCategory|
+-------+----------+--------+-----------+------+--------+---------------+-----------+--------+---------+--------+------+--------------+
|CA-1001|2023-01-15|    Ravi|   Consumer| South|  Laptop|     Technology|  Computers|       1|    55000|     0.1|  5000|          High|
|CA-1002|2023-02-20|   Priya|  Corporate| North| Printer|     Technology|Peripherals|       2|    12000|    0.15|  1800|        Medium|
|CA-1003|2023-01-25|    Amit|   Consumer|  East|Notebook|Office Supplies|      Paper|       3|      200|    0.05|   150|        Medium|
|CA-1004|2023-03-01|   Anita|Home Office|  West|   Table|      Furniture|     Tables|       1|    18000|     0.2| -1500|          Loss|
|CA-1005|2023-02-05|   Divya|   Consumer| South|

8. Use drop() to remove SubCategory .

In [58]:
droping = df.drop('SubCategory')
droping.show()

+-------+----------+--------+-----------+------+--------+---------------+--------+---------+--------+------+
|OrderID| OrderDate|Customer|    Segment|Region| Product|       Category|Quantity|UnitPrice|Discount|Profit|
+-------+----------+--------+-----------+------+--------+---------------+--------+---------+--------+------+
|CA-1001|2023-01-15|    Ravi|   Consumer| South|  Laptop|     Technology|       1|    55000|     0.1|  5000|
|CA-1002|2023-02-20|   Priya|  Corporate| North| Printer|     Technology|       2|    12000|    0.15|  1800|
|CA-1003|2023-01-25|    Amit|   Consumer|  East|Notebook|Office Supplies|       3|      200|    0.05|   150|
|CA-1004|2023-03-01|   Anita|Home Office|  West|   Table|      Furniture|       1|    18000|     0.2| -1500|
|CA-1005|2023-02-05|   Divya|   Consumer| South|   Phone|     Technology|       2|    20000|     0.0|  3000|
+-------+----------+--------+-----------+------+--------+---------------+--------+---------+--------+------+



9. Handle nulls in Discount using fillna(0.10) .

In [61]:
df = df.fillna({"Discount": 0.10})
df.show()

+-------+----------+--------+-----------+------+--------+---------------+-----------+--------+---------+--------+------+
|OrderID| OrderDate|Customer|    Segment|Region| Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|
+-------+----------+--------+-----------+------+--------+---------------+-----------+--------+---------+--------+------+
|CA-1001|2023-01-15|    Ravi|   Consumer| South|  Laptop|     Technology|  Computers|       1|    55000|     0.1|  5000|
|CA-1002|2023-02-20|   Priya|  Corporate| North| Printer|     Technology|Peripherals|       2|    12000|    0.15|  1800|
|CA-1003|2023-01-25|    Amit|   Consumer|  East|Notebook|Office Supplies|      Paper|       3|      200|    0.05|   150|
|CA-1004|2023-03-01|   Anita|Home Office|  West|   Table|      Furniture|     Tables|       1|    18000|     0.2| -1500|
|CA-1005|2023-02-05|   Divya|   Consumer| South|   Phone|     Technology|     Phones|       2|    20000|     0.0|  3000|
+-------+----------+--------+---

10. Convert OrderDate to date type and extract year , month .

In [64]:
from pyspark.sql.functions import to_date, year, month

df = df.withColumn("OrderDate", to_date(col("OrderDate"), "yyyy-MM-dd"))
df = df.withColumn("OrderYear", year("OrderDate"))
df = df.withColumn("OrderMonth", month("OrderDate"))
df.select("OrderID", "OrderDate", "OrderYear", "OrderMonth").show()

+-------+----------+---------+----------+
|OrderID| OrderDate|OrderYear|OrderMonth|
+-------+----------+---------+----------+
|CA-1001|2023-01-15|     2023|         1|
|CA-1002|2023-02-20|     2023|         2|
|CA-1003|2023-01-25|     2023|         1|
|CA-1004|2023-03-01|     2023|         3|
|CA-1005|2023-02-05|     2023|         2|
+-------+----------+---------+----------+



**PART 3: Dask DataFrame Operations (Pandas Alternative)**

1. Install Dask:

In [65]:
!pip install dask



2. Load the same superstore.csv :

In [70]:
import dask.dataframe as dd
df = dd.read_csv('superstore.csv')

3. Do the following:
Compute average discount by category.
Filter orders with more than 1 quantity and high profit.
Save filtered data to new CSV.

In [72]:
avg_discount = df.groupby('Category')['Discount'].mean().compute()
print(avg_discount)

filtered = df[(df['Quantity'] > 1) & (df['Profit'] > 2000)]

filtered.to_csv('filtered_orders_*.csv', single_file=True)

Category
Furniture          0.200000
Office Supplies    0.050000
Technology         0.083333
Name: Discount, dtype: float64


['/content/filtered_orders_*.csv']

**PART 4: JSON Handling (Complex Nested)**

1. Create a nested JSON file:

In [84]:
import json

orders =[
{
"OrderID": "CA-1001",
"Customer": {"Name": "Ravi", "Segment": "Consumer"},
"Details": {"Region": "South", "Profit": 5000}
},
{
"OrderID": "CA-1002",
"Customer": {"Name": "Priya", "Segment": "Corporate"},
"Details": {"Region": "North", "Profit": 1800}
}
]

with open('orders.json', 'w') as f:
    json.dump(orders, f, indent=4)

print("orders.json created successfully.")

orders.json created successfully.


2. Load it using PySpark:

In [85]:
spark = SparkSession.builder.appName("JSONHandling").getOrCreate()

df_json = spark.read.json('orders.json', multiLine=True)
df_json.printSchema()

root
 |-- Customer: struct (nullable = true)
 |    |-- Name: string (nullable = true)
 |    |-- Segment: string (nullable = true)
 |-- Details: struct (nullable = true)
 |    |-- Profit: long (nullable = true)
 |    |-- Region: string (nullable = true)
 |-- OrderID: string (nullable = true)



In [86]:
df_json.select("OrderID", "Customer.Name", "Details.Profit").show()

+-------+-----+------+
|OrderID| Name|Profit|
+-------+-----+------+
|CA-1001| Ravi|  5000|
|CA-1002|Priya|  1800|
+-------+-----+------+

