### *READING AND WRITING DIFFERENT FILE FORMATS*
Databricks can read data from and write data to a variety of data formats such as CSV, JSON, Parquet, avro , xlsx(excel format). So in this Notebook we are going to see how to read and write these formats.

###*CSV FILE FORMAT (READ)*
Comma-separated values (CSV) is a text file format that uses commas to separate values. A CSV file stores tabular data (numbers and text) in plain text, where each line of the file typically represents one data record. Each record consists of the same number of fields, and these are separated by commas in the CSV file.

In [0]:
# file location and type
file_location = "/FileStore/tables/us_counties.csv"
file_type = "csv"

# file options
infer_schema = "true"
header = "true"
delimeter = ","

# creating a csv file

csv_df = (
    spark.read.format(file_type)
    .option("header", header)
    .option("inferSchema", infer_schema)
    .option("sep", delimeter)
    .load(file_location)
)

csv_df.limit(10).display()
csv_df.printSchema()

date,state,county,fips,cases,deaths
2020-01-21,Washington,Snohomish,53061,1,0
2020-01-22,Washington,Snohomish,53061,1,0
2020-01-23,Washington,Snohomish,53061,1,0
2020-01-24,Illinois,Cook,17031,1,0
2020-01-24,Washington,Snohomish,53061,1,0
2020-01-25,California,Orange,6059,1,0
2020-01-25,Illinois,Cook,17031,1,0
2020-01-25,Washington,Snohomish,53061,1,0
2020-01-26,Arizona,Maricopa,4013,1,0
2020-01-26,California,Los Angeles,6037,1,0


root
 |-- date: date (nullable = true)
 |-- state: string (nullable = true)
 |-- county: string (nullable = true)
 |-- fips: integer (nullable = true)
 |-- cases: integer (nullable = true)
 |-- deaths: integer (nullable = true)



If we want to see how originally the csv file has been stored means we have to use the text file format and load the csv file.

In [0]:
text_df = spark.read.format("text").load("/FileStore/tables/us_counties.csv")
text_df.limit(10).display(10)

value
"date,state,county,fips,cases,deaths"
"21-01-2020,Washington,Snohomish,53061,1,0"
"22-01-2020,Washington,Snohomish,53061,1,0"
"23-01-2020,Washington,Snohomish,53061,1,0"
"24-01-2020,Illinois,Cook,17031,1,0"
"24-01-2020,Washington,Snohomish,53061,1,0"
"25-01-2020,California,Orange,6059,1,0"
"25-01-2020,Illinois,Cook,17031,1,0"
"25-01-2020,Washington,Snohomish,53061,1,0"
"26-01-2020,Arizona,Maricopa,4013,1,0"


###*CSV FILE FORMAT (WRITE)*
Using the write() method we can write a file in different location.

Saving modes available in write() method:

-> append   : Append contents of this DataFrame to existing data.

-> overwrite: Overwrite existing data.

-> ignore   : ignore this operation if data already exists.

-> error    : Throw an exception if data already \exists.


In [0]:
csv_df.write.csv(path="/mnt/FileStore", header="true", mode="overwrite")

In [0]:
df1 = (
    spark.read.format(file_type)
    .option("header", header)
    .option("inferSchema", infer_schema)
    .option("sep", delimeter)
    .load("/mnt/FileStore")
)

df1.limit(10).display()

date,state,county,fips,cases,deaths
2020-01-21,Washington,Snohomish,53061,1,0
2020-01-22,Washington,Snohomish,53061,1,0
2020-01-23,Washington,Snohomish,53061,1,0
2020-01-24,Illinois,Cook,17031,1,0
2020-01-24,Washington,Snohomish,53061,1,0
2020-01-25,California,Orange,6059,1,0
2020-01-25,Illinois,Cook,17031,1,0
2020-01-25,Washington,Snohomish,53061,1,0
2020-01-26,Arizona,Maricopa,4013,1,0
2020-01-26,California,Los Angeles,6037,1,0


###*JSON FILE FORMAT* 
JavaScript Object Notation (JSON) is a standard text-based format for representing structured data based on JavaScript object syntax. It is commonly used for transmitting data in web applications (e.g., sending some data from the server to the client, so it can be displayed on a web page, or vice versa).

In [0]:
json_df = spark.read.json("/FileStore/tables/test_file-1.json", multiLine=True)
display(json_df)

code,name,rank
US,United States,1.0
CA,Canada,10.0
GB,United Kingdom,6.0
AU,Australia,13.0
DE,Germany,4.0
BR,Brazil,
IT,Italy,
NL,Netherlands,19.0
SE,Sweden,14.0
NO,Norway,9.0


To view the original JSON file load the data using text format .

In [0]:
json_df = spark.read.text("/FileStore/tables/test_file-1.json")
json_df.limit(10).display()

value
[
{
"""name"": ""United States"","
"""code"": ""US"","
"""rank"": 1"
"},"
{
"""name"": ""Canada"","
"""code"": ""CA"","
"""rank"": 10"


###*PARQUET FILE FORMAT* 
Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk

In [0]:
parquet_df = spark.read.format("parquet").load("/FileStore/tables/Titanic.parquet")
parquet_df.limit(10).display()

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [0]:
csv_df.write.parquet(path='/mnt/FileStore/sample')

In [0]:
par = spark.read.format("parquet").load("/mnt/FileStore/sample")
par.limit(10).display()

date,state,county,fips,cases,deaths
2020-01-21,Washington,Snohomish,53061,1,0
2020-01-22,Washington,Snohomish,53061,1,0
2020-01-23,Washington,Snohomish,53061,1,0
2020-01-24,Illinois,Cook,17031,1,0
2020-01-24,Washington,Snohomish,53061,1,0
2020-01-25,California,Orange,6059,1,0
2020-01-25,Illinois,Cook,17031,1,0
2020-01-25,Washington,Snohomish,53061,1,0
2020-01-26,Arizona,Maricopa,4013,1,0
2020-01-26,California,Los Angeles,6037,1,0


###*Avro File Format*
Avro stores the data definition in JSON format making it easy to read and interpret; the data itself is stored in binary format making it compact and efficient. Avro files include markers that can be used to split large data sets into subsets suitable for Apache MapReduce processing.

In [0]:
avro_df = spark.read.format("avro").load("/FileStore/tables/twitter.avro")
display(avro_df)

username,tweet,timestamp
miguno,"Rock: Nerf paper, scissors is fine.",1366150681
BlizzardCS,Works as intended. Terran is IMBA.,1366154481
