### Exploratory Data Analysis - Extract Data
In this notebook we will practice extracting data into Spark dataframe and then store the data from Spark dataframe into Python/pandas dataframe and into databricks table for future use. Having the data into pandas dataframe will allow us to perform exploratory data analysis and visualization using python in the furure notebooks.

We will extract data from common file types such as -
- CSV
- JSON

#### Extract data from CSV file

In [0]:
# The following code will read CSV file using Spark from DBFS into spark dataframe

""" 
1. Passing True in the header option lets spark to treat the first row as the header of the dataframe, without that 
the columns will be named as c0, c1, and so on.
2. Passing True in the inferSchema option lets spark to interpret the correct datatype of each column based on the 
data, without that all columns will be of string type.
"""

df1 = spark.read.format("csv") \
    .option("header", True) \
    .option("inferSchema", True) \
    .load("dbfs:/FileStore/DataFiles/SuperstoreOrders.csv")

In [0]:
# Display the content of the spark dataframe read from the CSV file
display(df1.limit(5))

Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country/Region,City,State/Province,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
1,US-2020-103800,2020-01-03,2020-01-07,Standard Class,DP-13000,Darren Powers,Consumer,United States,Houston,Texas,77095,Central,OFF-PA-10000174,Office Supplies,Paper,"""Message Book, Wirebound, Four 5 1/2"""" X 4"""" Forms/Pg.","200 Dupl. Sets/Book""",16.448,2.0,0.2
2,US-2020-112326,2020-01-04,2020-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,Illinois,60540,Central,OFF-BI-10004094,Office Supplies,Binders,GBC Standard Plastic Binding Systems Combs,3.54,2.0,0.8,-5.487
3,US-2020-112326,2020-01-04,2020-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,Illinois,60540,Central,OFF-LA-10003223,Office Supplies,Labels,Avery 508,11.784,3.0,0.2,4.2717
4,US-2020-112326,2020-01-04,2020-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,Illinois,60540,Central,OFF-ST-10002743,Office Supplies,Storage,SAFCO Boltless Steel Shelving,272.736,3.0,0.2,-64.7748
5,US-2020-141817,2020-01-05,2020-01-12,Standard Class,MB-18085,Mick Brown,Consumer,United States,Philadelphia,Pennsylvania,19143,East,OFF-AR-10003478,Office Supplies,Art,"Avery Hi-Liter EverBold Pen Style Fluorescent Highlighters, 4/Pack",19.536,3.0,0.2,4.884


#### Extract data from JSON file

In [0]:
# The following code will read JSON file using Spark from DBFS into Spark dataframe

df2 = spark.read.format("json") \
    .load("dbfs:/FileStore/DataFiles/SuperstoreRegions.json")

In [0]:
# Display the content of the spark dataframe read from the JSON file
display(df2)

Region,RegionalManager
West,Sadie Pawthorne
East,Chuck Magee
Central,Roxanne Rodriguez
South,Fred Suzuki


#### Spark dataframe to pandas dataframe

In [0]:
# Check the datatype of the original dataframe, which will be a Spark SQL dataframe
print(type(df1)) 

# Convert that into pandas dataframe
pdf1 = df1.toPandas()
pdf2 = df2.toPandas()

# Check the datatype of the new dataframe, which will be a pandas dataframe
print(type(pdf1)) 

<class 'pyspark.sql.dataframe.DataFrame'>
<class 'pandas.core.frame.DataFrame'>


In [0]:
# Display the first 5 records of the pandas dataframe
pdf1.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country/Region,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,US-2020-103800,2020-01-03,2020-01-07,Standard Class,DP-13000,Darren Powers,Consumer,United States,Houston,...,77095,Central,OFF-PA-10000174,Office Supplies,Paper,"""Message Book, Wirebound, Four 5 1/2"""" X 4"""" F...","200 Dupl. Sets/Book""",16.448,2.0,0.2
1,2,US-2020-112326,2020-01-04,2020-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-BI-10004094,Office Supplies,Binders,GBC Standard Plastic Binding Systems Combs,3.54,2.0,0.8,-5.487
2,3,US-2020-112326,2020-01-04,2020-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-LA-10003223,Office Supplies,Labels,Avery 508,11.784,3.0,0.2,4.2717
3,4,US-2020-112326,2020-01-04,2020-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-ST-10002743,Office Supplies,Storage,SAFCO Boltless Steel Shelving,272.736,3.0,0.2,-64.7748
4,5,US-2020-141817,2020-01-05,2020-01-12,Standard Class,MB-18085,Mick Brown,Consumer,United States,Philadelphia,...,19143,East,OFF-AR-10003478,Office Supplies,Art,Avery Hi-Liter EverBold Pen Style Fluorescent ...,19.536,3.0,0.2,4.884


In [0]:
# Display the first 5 records of the pandas dataframe
pdf2.head()

Unnamed: 0,Region,RegionalManager
0,West,Sadie Pawthorne
1,East,Chuck Magee
2,Central,Roxanne Rodriguez
3,South,Fred Suzuki
