Master DataFrame Assignment – Retail Sales Superstore
Dataset (All-In-One)

We'll use a well-structured retail dataset like the Superstore Sales dataset (mini
version provided below).

Step 1: Sample Dataset (create CSV)

Save as superstore.csv :

In [127]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [128]:
import pandas as pd
import numpy as np

In [129]:
csv_data="""OrderID,OrderDate,Customer,Segment,Region,Product,Category,SubCategory,Quantity,UnitPrice,Discount,Profit
CA-1001,2023-01-15,Ravi,Consumer,South,Laptop,Technology,Computers,1,55000,0.10,5000
CA-1002,2023-02-
20,Priya,Corporate,North,Printer,Technology,Peripherals,2,12000,0.15,1800
CA-1003,2023-01-25,Amit,Consumer,East,Notebook,Office Supplies,Paper,3,200,0.05,150
CA-1004,2023-03-01,Anita,Home Office,West,Table,Furniture,Tables,1,18000,0.20,-1500
CA-1005,2023-02-05,Divya,Consumer,South,Phone,Technology,Phones,2,20000,0.00,3000"""


with open("/content/drive/MyDrive/Colab Notebooks/superstore.csv","w") as file:
  file.write(csv_data)

TASKS ACROSS Pandas, PySpark, and Dask

PART 1: Pandas DataFrame Operations

1. Load the CSV using pandas .

In [130]:
df=pd.read_csv("/content/drive/MyDrive/Colab Notebooks/superstore.csv")
print(df)

   OrderID   OrderDate   Customer      Segment   Region     Product  \
0  CA-1001  2023-01-15       Ravi     Consumer    South      Laptop   
1  CA-1002    2023-02-        NaN          NaN      NaN         NaN   
2       20       Priya  Corporate        North  Printer  Technology   
3  CA-1003  2023-01-25       Amit     Consumer     East    Notebook   
4  CA-1004  2023-03-01      Anita  Home Office     West       Table   
5  CA-1005  2023-02-05      Divya     Consumer    South       Phone   

          Category SubCategory  Quantity  UnitPrice  Discount  Profit  
0       Technology   Computers       1.0   55000.00      0.10  5000.0  
1              NaN         NaN       NaN        NaN       NaN     NaN  
2      Peripherals           2   12000.0       0.15   1800.00     NaN  
3  Office Supplies       Paper       3.0     200.00      0.05   150.0  
4        Furniture      Tables       1.0   18000.00      0.20 -1500.0  
5       Technology      Phones       2.0   20000.00      0.00  3000.0 

2. Print schema, head, shape, dtypes.

In [131]:
print(df.info())
print(df.head())
print(df.shape)
print(df.dtypes)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   OrderID      6 non-null      object 
 1   OrderDate    6 non-null      object 
 2   Customer     5 non-null      object 
 3   Segment      5 non-null      object 
 4   Region       5 non-null      object 
 5   Product      5 non-null      object 
 6   Category     5 non-null      object 
 7   SubCategory  5 non-null      object 
 8   Quantity     5 non-null      float64
 9   UnitPrice    5 non-null      float64
 10  Discount     5 non-null      float64
 11  Profit       4 non-null      float64
dtypes: float64(4), object(8)
memory usage: 708.0+ bytes
None
   OrderID   OrderDate   Customer      Segment   Region     Product  \
0  CA-1001  2023-01-15       Ravi     Consumer    South      Laptop   
1  CA-1002    2023-02-        NaN          NaN      NaN         NaN   
2       20       Priya  Corporate   

3. Select Customer , Product , Profit columns.

In [132]:
print(df[['Customer','Product','Profit']])

    Customer     Product  Profit
0       Ravi      Laptop  5000.0
1        NaN         NaN     NaN
2  Corporate  Technology     NaN
3       Amit    Notebook   150.0
4      Anita       Table -1500.0
5      Divya       Phone  3000.0


4. Filter orders where Profit > 2000 and Discount = 0 .

In [133]:
print(df[(df['Profit']>2000)&(df['Discount']==0)])

   OrderID   OrderDate Customer   Segment Region Product    Category  \
5  CA-1005  2023-02-05    Divya  Consumer  South   Phone  Technology   

  SubCategory  Quantity  UnitPrice  Discount  Profit  
5      Phones       2.0    20000.0       0.0  3000.0  


5. Sort by Profit descending.

In [134]:
print(df.sort_values(by='Profit',ascending=False))

   OrderID   OrderDate   Customer      Segment   Region     Product  \
0  CA-1001  2023-01-15       Ravi     Consumer    South      Laptop   
5  CA-1005  2023-02-05      Divya     Consumer    South       Phone   
3  CA-1003  2023-01-25       Amit     Consumer     East    Notebook   
4  CA-1004  2023-03-01      Anita  Home Office     West       Table   
1  CA-1002    2023-02-        NaN          NaN      NaN         NaN   
2       20       Priya  Corporate        North  Printer  Technology   

          Category SubCategory  Quantity  UnitPrice  Discount  Profit  
0       Technology   Computers       1.0   55000.00      0.10  5000.0  
5       Technology      Phones       2.0   20000.00      0.00  3000.0  
3  Office Supplies       Paper       3.0     200.00      0.05   150.0  
4        Furniture      Tables       1.0   18000.00      0.20 -1500.0  
1              NaN         NaN       NaN        NaN       NaN     NaN  
2      Peripherals           2   12000.0       0.15   1800.00     NaN 

6. GroupBy Category → Total Profit, Avg Discount.

In [135]:
df1=(df.groupby('Category')[['Profit','Discount']].agg(['sum','mean']))
print(df1)

                 Profit         Discount         
                    sum    mean      sum     mean
Category                                         
Furniture       -1500.0 -1500.0     0.20     0.20
Office Supplies   150.0   150.0     0.05     0.05
Peripherals         0.0     NaN  1800.00  1800.00
Technology       8000.0  4000.0     0.10     0.05


7. Add a column TotalPrice = Quantity * UnitPrice .

In [136]:
df['Totalprice']=df['Quantity']*df['UnitPrice']
print(df)

   OrderID   OrderDate   Customer      Segment   Region     Product  \
0  CA-1001  2023-01-15       Ravi     Consumer    South      Laptop   
1  CA-1002    2023-02-        NaN          NaN      NaN         NaN   
2       20       Priya  Corporate        North  Printer  Technology   
3  CA-1003  2023-01-25       Amit     Consumer     East    Notebook   
4  CA-1004  2023-03-01      Anita  Home Office     West       Table   
5  CA-1005  2023-02-05      Divya     Consumer    South       Phone   

          Category SubCategory  Quantity  UnitPrice  Discount  Profit  \
0       Technology   Computers       1.0   55000.00      0.10  5000.0   
1              NaN         NaN       NaN        NaN       NaN     NaN   
2      Peripherals           2   12000.0       0.15   1800.00     NaN   
3  Office Supplies       Paper       3.0     200.00      0.05   150.0   
4        Furniture      Tables       1.0   18000.00      0.20 -1500.0   
5       Technology      Phones       2.0   20000.00      0.00  3

8. Drop the SubCategory column.

In [137]:
df.drop(columns='SubCategory',inplace=True)
print(df)

   OrderID   OrderDate   Customer      Segment   Region     Product  \
0  CA-1001  2023-01-15       Ravi     Consumer    South      Laptop   
1  CA-1002    2023-02-        NaN          NaN      NaN         NaN   
2       20       Priya  Corporate        North  Printer  Technology   
3  CA-1003  2023-01-25       Amit     Consumer     East    Notebook   
4  CA-1004  2023-03-01      Anita  Home Office     West       Table   
5  CA-1005  2023-02-05      Divya     Consumer    South       Phone   

          Category  Quantity  UnitPrice  Discount  Profit  Totalprice  
0       Technology       1.0   55000.00      0.10  5000.0     55000.0  
1              NaN       NaN        NaN       NaN     NaN         NaN  
2      Peripherals   12000.0       0.15   1800.00     NaN      1800.0  
3  Office Supplies       3.0     200.00      0.05   150.0       600.0  
4        Furniture       1.0   18000.00      0.20 -1500.0     18000.0  
5       Technology       2.0   20000.00      0.00  3000.0     40000.0 

9. Fill nulls in Discount with 0.10.

In [138]:
df['Discount']=df['Discount'].fillna(0.10)
print(df)

   OrderID   OrderDate   Customer      Segment   Region     Product  \
0  CA-1001  2023-01-15       Ravi     Consumer    South      Laptop   
1  CA-1002    2023-02-        NaN          NaN      NaN         NaN   
2       20       Priya  Corporate        North  Printer  Technology   
3  CA-1003  2023-01-25       Amit     Consumer     East    Notebook   
4  CA-1004  2023-03-01      Anita  Home Office     West       Table   
5  CA-1005  2023-02-05      Divya     Consumer    South       Phone   

          Category  Quantity  UnitPrice  Discount  Profit  Totalprice  
0       Technology       1.0   55000.00      0.10  5000.0     55000.0  
1              NaN       NaN        NaN      0.10     NaN         NaN  
2      Peripherals   12000.0       0.15   1800.00     NaN      1800.0  
3  Office Supplies       3.0     200.00      0.05   150.0       600.0  
4        Furniture       1.0   18000.00      0.20 -1500.0     18000.0  
5       Technology       2.0   20000.00      0.00  3000.0     40000.0 

10. Apply a function to categorize orders:

In [139]:
def classify(row):
  if row['Profit'] > 4000:
    return 'High'
  elif row['Profit'] > 0:
    return 'Medium'
  else:
    return 'Low'

df['Categorize_Orders']=df.apply(classify,axis=1)
print(df)

   OrderID   OrderDate   Customer      Segment   Region     Product  \
0  CA-1001  2023-01-15       Ravi     Consumer    South      Laptop   
1  CA-1002    2023-02-        NaN          NaN      NaN         NaN   
2       20       Priya  Corporate        North  Printer  Technology   
3  CA-1003  2023-01-25       Amit     Consumer     East    Notebook   
4  CA-1004  2023-03-01      Anita  Home Office     West       Table   
5  CA-1005  2023-02-05      Divya     Consumer    South       Phone   

          Category  Quantity  UnitPrice  Discount  Profit  Totalprice  \
0       Technology       1.0   55000.00      0.10  5000.0     55000.0   
1              NaN       NaN        NaN      0.10     NaN         NaN   
2      Peripherals   12000.0       0.15   1800.00     NaN      1800.0   
3  Office Supplies       3.0     200.00      0.05   150.0       600.0   
4        Furniture       1.0   18000.00      0.20 -1500.0     18000.0   
5       Technology       2.0   20000.00      0.00  3000.0     40

PART 2: PySpark DataFrame Operations

1. Load the same CSV using PySpark.

In [140]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("superstore").getOrCreate()
import pyspark.sql.functions as F

df=spark.read.csv("/content/drive/MyDrive/Colab Notebooks/superstore.csv",header=True,inferSchema=True)
df.show()

+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+
|OrderID| OrderDate| Customer|    Segment| Region|   Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|
+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+
|CA-1001|2023-01-15|     Ravi|   Consumer|  South|    Laptop|     Technology|  Computers|       1|  55000.0|     0.1|  5000|
|CA-1002|  2023-02-|     NULL|       NULL|   NULL|      NULL|           NULL|       NULL|    NULL|     NULL|    NULL|  NULL|
|     20|     Priya|Corporate|      North|Printer|Technology|    Peripherals|          2|   12000|     0.15|  1800.0|  NULL|
|CA-1003|2023-01-25|     Amit|   Consumer|   East|  Notebook|Office Supplies|      Paper|       3|    200.0|    0.05|   150|
|CA-1004|2023-03-01|    Anita|Home Office|   West|     Table|      Furniture|     Tables|       1|  18000.0|     0.2| -1500|


2. Show schema and first 5 rows.

In [141]:
df.printSchema()
df.show(5)

root
 |-- OrderID: string (nullable = true)
 |-- OrderDate: string (nullable = true)
 |-- Customer: string (nullable = true)
 |-- Segment: string (nullable = true)
 |-- Region: string (nullable = true)
 |-- Product: string (nullable = true)
 |-- Category: string (nullable = true)
 |-- SubCategory: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- Discount: double (nullable = true)
 |-- Profit: integer (nullable = true)

+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+
|OrderID| OrderDate| Customer|    Segment| Region|   Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|
+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+
|CA-1001|2023-01-15|     Ravi|   Consumer|  South|    Laptop|     Technology|  Computers|       1|  55000.0|     0.1|  5000|
|CA-1002|  202

3. Select columns, Rename Customer → Client .

In [142]:
df=df.selectExpr("OrderID","OrderDate","Customer as Client","Segment","Region","Product","Category","SubCategory","Quantity","UnitPrice","Discount","Profit")
df.show()

+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+
|OrderID| OrderDate|   Client|    Segment| Region|   Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|
+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+
|CA-1001|2023-01-15|     Ravi|   Consumer|  South|    Laptop|     Technology|  Computers|       1|  55000.0|     0.1|  5000|
|CA-1002|  2023-02-|     NULL|       NULL|   NULL|      NULL|           NULL|       NULL|    NULL|     NULL|    NULL|  NULL|
|     20|     Priya|Corporate|      North|Printer|Technology|    Peripherals|          2|   12000|     0.15|  1800.0|  NULL|
|CA-1003|2023-01-25|     Amit|   Consumer|   East|  Notebook|Office Supplies|      Paper|       3|    200.0|    0.05|   150|
|CA-1004|2023-03-01|    Anita|Home Office|   West|     Table|      Furniture|     Tables|       1|  18000.0|     0.2| -1500|


4. Filter Segment = 'Consumer' and Profit < 1000 .

In [143]:
df.filter((F.col('Segment')=='Consumer')&(F.col('Profit')<1000)).show()

+-------+----------+------+--------+------+--------+---------------+-----------+--------+---------+--------+------+
|OrderID| OrderDate|Client| Segment|Region| Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|
+-------+----------+------+--------+------+--------+---------------+-----------+--------+---------+--------+------+
|CA-1003|2023-01-25|  Amit|Consumer|  East|Notebook|Office Supplies|      Paper|       3|    200.0|    0.05|   150|
+-------+----------+------+--------+------+--------+---------------+-----------+--------+---------+--------+------+



5. GroupBy Region and show average profit.

In [144]:
df.groupBy('Region').agg(F.avg('Profit')).show()

+-------+-----------+
| Region|avg(Profit)|
+-------+-----------+
|   NULL|       NULL|
|  South|     4000.0|
|   East|      150.0|
|   West|    -1500.0|
|Printer|       NULL|
+-------+-----------+



6. Use withColumn to create TotalPrice = Quantity * UnitPrice .

In [145]:
df=df.withColumn('TotalPrice',F.col('Quantity')*F.col('UnitPrice'))
df.show()

+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+----------+
|OrderID| OrderDate|   Client|    Segment| Region|   Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|TotalPrice|
+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+----------+
|CA-1001|2023-01-15|     Ravi|   Consumer|  South|    Laptop|     Technology|  Computers|       1|  55000.0|     0.1|  5000|   55000.0|
|CA-1002|  2023-02-|     NULL|       NULL|   NULL|      NULL|           NULL|       NULL|    NULL|     NULL|    NULL|  NULL|      NULL|
|     20|     Priya|Corporate|      North|Printer|Technology|    Peripherals|          2|   12000|     0.15|  1800.0|  NULL|    1800.0|
|CA-1003|2023-01-25|     Amit|   Consumer|   East|  Notebook|Office Supplies|      Paper|       3|    200.0|    0.05|   150|     600.0|
|CA-1004|2023-03-01|    Anita|Home Office|   Wes

7. Use when().otherwise() to classify Profit as:

In [146]:
df=df.withColumn('Categorize_Orders',F.when(F.col('Profit')>2000,'High').otherwise(F.when(F.col('Profit')<=0,'Low').otherwise('Medium')))
df.show()

+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+----------+-----------------+
|OrderID| OrderDate|   Client|    Segment| Region|   Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|TotalPrice|Categorize_Orders|
+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+----------+-----------------+
|CA-1001|2023-01-15|     Ravi|   Consumer|  South|    Laptop|     Technology|  Computers|       1|  55000.0|     0.1|  5000|   55000.0|             High|
|CA-1002|  2023-02-|     NULL|       NULL|   NULL|      NULL|           NULL|       NULL|    NULL|     NULL|    NULL|  NULL|      NULL|           Medium|
|     20|     Priya|Corporate|      North|Printer|Technology|    Peripherals|          2|   12000|     0.15|  1800.0|  NULL|    1800.0|           Medium|
|CA-1003|2023-01-25|     Amit|   Consumer|   East|  Notebook|Office Supplies

8. Use drop() to remove SubCategory .

In [147]:
df.drop('SubCategory').show()

+-------+----------+---------+-----------+-------+----------+---------------+--------+---------+--------+------+----------+-----------------+
|OrderID| OrderDate|   Client|    Segment| Region|   Product|       Category|Quantity|UnitPrice|Discount|Profit|TotalPrice|Categorize_Orders|
+-------+----------+---------+-----------+-------+----------+---------------+--------+---------+--------+------+----------+-----------------+
|CA-1001|2023-01-15|     Ravi|   Consumer|  South|    Laptop|     Technology|       1|  55000.0|     0.1|  5000|   55000.0|             High|
|CA-1002|  2023-02-|     NULL|       NULL|   NULL|      NULL|           NULL|    NULL|     NULL|    NULL|  NULL|      NULL|           Medium|
|     20|     Priya|Corporate|      North|Printer|Technology|    Peripherals|   12000|     0.15|  1800.0|  NULL|    1800.0|           Medium|
|CA-1003|2023-01-25|     Amit|   Consumer|   East|  Notebook|Office Supplies|       3|    200.0|    0.05|   150|     600.0|           Medium|
|CA-10

9. Handle nulls in Discount using fillna(0.10) .

In [148]:
df=df.fillna({'Discount':0.10})
df.show()

+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+----------+-----------------+
|OrderID| OrderDate|   Client|    Segment| Region|   Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|TotalPrice|Categorize_Orders|
+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+----------+-----------------+
|CA-1001|2023-01-15|     Ravi|   Consumer|  South|    Laptop|     Technology|  Computers|       1|  55000.0|     0.1|  5000|   55000.0|             High|
|CA-1002|  2023-02-|     NULL|       NULL|   NULL|      NULL|           NULL|       NULL|    NULL|     NULL|     0.1|  NULL|      NULL|           Medium|
|     20|     Priya|Corporate|      North|Printer|Technology|    Peripherals|          2|   12000|     0.15|  1800.0|  NULL|    1800.0|           Medium|
|CA-1003|2023-01-25|     Amit|   Consumer|   East|  Notebook|Office Supplies

10. Convert OrderDate to date type and extract year , month .

In [149]:
df=df.withColumn('OrderDate',F.col('OrderDate').cast('date'))
df.withColumn('Year',F.year('OrderDate')).show()
df.withColumn('Month',F.month('OrderDate')).show()
df.show()

+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+----------+-----------------+----+
|OrderID| OrderDate|   Client|    Segment| Region|   Product|       Category|SubCategory|Quantity|UnitPrice|Discount|Profit|TotalPrice|Categorize_Orders|Year|
+-------+----------+---------+-----------+-------+----------+---------------+-----------+--------+---------+--------+------+----------+-----------------+----+
|CA-1001|2023-01-15|     Ravi|   Consumer|  South|    Laptop|     Technology|  Computers|       1|  55000.0|     0.1|  5000|   55000.0|             High|2023|
|CA-1002|      NULL|     NULL|       NULL|   NULL|      NULL|           NULL|       NULL|    NULL|     NULL|     0.1|  NULL|      NULL|           Medium|NULL|
|     20|      NULL|Corporate|      North|Printer|Technology|    Peripherals|          2|   12000|     0.15|  1800.0|  NULL|    1800.0|           Medium|NULL|
|CA-1003|2023-01-25|     Amit|   Consumer|   E

PART 3: Dask DataFrame Operations (Pandas Alternative)

1. Install Dask:

!pip install dask

In [150]:
!pip install dask



2.Load the same superstore.csv :

In [151]:
import dask.dataframe as dd
df=dd.read_csv("/content/drive/MyDrive/Colab Notebooks/superstore.csv")
df.compute()

Unnamed: 0,OrderID,OrderDate,Customer,Segment,Region,Product,Category,SubCategory,Quantity,UnitPrice,Discount,Profit
0,CA-1001,2023-01-15,Ravi,Consumer,South,Laptop,Technology,Computers,1.0,55000.0,0.1,5000.0
1,CA-1002,2023-02-,,,,,,,,,,
2,20,Priya,Corporate,North,Printer,Technology,Peripherals,2,12000.0,0.15,1800.0,
3,CA-1003,2023-01-25,Amit,Consumer,East,Notebook,Office Supplies,Paper,3.0,200.0,0.05,150.0
4,CA-1004,2023-03-01,Anita,Home Office,West,Table,Furniture,Tables,1.0,18000.0,0.2,-1500.0
5,CA-1005,2023-02-05,Divya,Consumer,South,Phone,Technology,Phones,2.0,20000.0,0.0,3000.0


3. Do the following:

Compute average discount by category.

In [152]:
df.groupby('Category')['Discount'].mean().compute()

Unnamed: 0_level_0,Discount
Category,Unnamed: 1_level_1
Furniture,0.2
Office Supplies,0.05
Peripherals,1800.0
Technology,0.05
,


Filter orders with more than 1 quantity and high profit.

In [153]:
df[(df['Quantity']>1)&(df['Profit']>2000)].compute()

Unnamed: 0,OrderID,OrderDate,Customer,Segment,Region,Product,Category,SubCategory,Quantity,UnitPrice,Discount,Profit
5,CA-1005,2023-02-05,Divya,Consumer,South,Phone,Technology,Phones,2.0,20000.0,0.0,3000.0


Save filtered data to new CSV.

In [154]:
df.to_csv('/content/drive/MyDrive/Colab Notebooks/superstore_filtered.csv',index=False)

['/content/drive/MyDrive/Colab Notebooks/superstore_filtered.csv/0.part']

PART 4: JSON Handling (Complex Nested)

1. Create a nested JSON file:

In [155]:
import json
data=[
{
"OrderID": "CA-1001",
"Customer": {"Name": "Ravi", "Segment": "Consumer"},
"Details": {"Region": "South", "Profit": 5000}
},
{
"OrderID": "CA-1002",
"Customer": {"Name": "Priya", "Segment": "Corporate"},
"Details": {"Region": "North", "Profit": 1800}
}
]

with open("orders.json","w") as file:
  json.dump(data,file,indent=4)

2. Load it using PySpark:

In [156]:
df_json = spark.read.json('orders.json', multiLine=True)
df_json.printSchema()
df_json.select("OrderID", "Customer.Name", "Details.Profit").show()

root
 |-- Customer: struct (nullable = true)
 |    |-- Name: string (nullable = true)
 |    |-- Segment: string (nullable = true)
 |-- Details: struct (nullable = true)
 |    |-- Profit: long (nullable = true)
 |    |-- Region: string (nullable = true)
 |-- OrderID: string (nullable = true)

+-------+-----+------+
|OrderID| Name|Profit|
+-------+-----+------+
|CA-1001| Ravi|  5000|
|CA-1002|Priya|  1800|
+-------+-----+------+

