<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/UNAL_Logosimbolo.svg/583px-UNAL_Logosimbolo.svg.png" alt="" width="1280" height="300" /></p>

# READ & STORE DATA

## READER

In PySpark, the reader is the interface you use to read data from external sources (like CSV, JSON, Parquet files, or databases) and load it into a DataFrame.

sintaxis:
```python
spark.read.[options(key, value)].format
```
References: 
* [pyspark.sql.DataFrameReader](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrameReader.html)
* [Data Source options](https://spark.apache.org/docs/latest/sql-data-sources.html)



### GET DATA

#### FROM INTERNET

In [0]:
from pyspark import SparkFiles

# read file from | internet only
url: str = "https://gitlab.com/luisvasv/public/-/raw/master/datasets/004.mock.data/001.dependents.csv"

# add to workspace
spark.sparkContext.addFile(url)

spark_uri: str = "file://"+SparkFiles.get("001.dependents.csv")
print(spark_uri)

In [0]:
%sh cat /local_disk0/spark-9163714a-6126-4e94-bc0e-e69388393d74/userFiles-82079eb8-83a8-4379-a9f2-6e2f076185d4/001.dependents.csv | head -n2

#### FROM DBFS

1. go this URL https://gitlab.com/luisvasv/public/-/blob/master/datasets/004.mock.data/001.dependents.csv?ref_type=heads
2. dowload the file
3. catalog > add data > upload file > get file path

In [0]:
spark_uri: str = "/FileStore/tables/001_dependents.csv"

### READ CSV AS REFERENCE

In **PySpark**, the reader (spark.read) follows the same standard interface for all data sources. You always use .format("name") to specify the type (like "csv", "parquet", "json", "jdbc"), and .option() to set the valid parameters for that specific source.
Alternatively, for some popular formats like Parquet, CSV, and JSON, PySpark also provides direct methods like spark.read.parquet(), spark.read.csv(), and spark.read.json(), which internally use the same .format() mechanism.

All csv parameters [here](https://spark.apache.org/docs/latest/sql-data-sources-csv.html)

#### WAY 1 - SPECIFIC SOURCE AND PARAMS AS **KWARGS

In [0]:
spark.read.csv(spark_uri, header=True, sep=",").display()

#### WAY 2 - SPECIFIC SOURCE AND PARAMS WITH OPTIONS METHOD

In [0]:

spark.read.option("sep", ",").option("header", True).csv(spark_uri).display()

#### WAY 3 - USING FORMAT METHOD

In [0]:
spark.read.format("csv").option("sep", ",").option("header", True).load(spark_uri).display()

#### WAY 4 - USING DICT

In [0]:
params = {
    "sep": ",",
    "header": True,
    "inferSchema": True
}

spark.read.format("csv").options(**params).load(spark_uri).display()

### SET SCHEMA

#### WAY 1 - INFER SCHEA OPTION

In [0]:
displayHTML("""
<div style="display: flex; align-items: center; background-color: #fdecea; border: 1px solid #f5c6cb; padding: 10px 14px; border-radius: 6px; font-family: Arial, sans-serif; color: #a94442; font-size: 15px; max-width: 600px;">
  <span style="font-size: 18px; margin-right: 10px;">❌</span>
  <strong>Bad Practice</strong>
</div>
""")

In [0]:
spark.read.option("sep", ",").option("header", True).option("inferSchema", True).csv(spark_uri).display()


#### WAY 2 - DEFINING STRUCT AND FIELD TYPE SCHEAM

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
    StructField("first_name", StringType(), True),
    StructField("last_name", StringType(), True),
    StructField("email_dependents", StringType(), True),
    StructField("age", StringType(), True)
])

spark.read.csv(spark_uri, header=True, sep=",", schema=schema).display()

#### WAY 3 - READ SCHEMA

In [0]:
spark.read.schema(schema).format("csv").option("sep", ",").option("header", True).load(spark_uri).display()

#### WAY 4 - SCHEMA TEXT WITH SQL FORMAT

In [0]:
sql_ddl = "first_name STRING, last_name STRING, email_dependents STRING, age INT"
spark.read.schema(sql_ddl).format("csv").option("sep", ",").option("header", True).load(spark_uri).display()

###  FORMATS WITH EMBEDDED SCHEMA

#### PARQUET (BEST AND DEFAULT)

Columnar storage, highly efficient, widely used in data lakes and big data pipelines.

[reference](https://spark.apache.org/docs/latest/sql-data-sources-parquet.html)

In [0]:
spark.read.parquet("file:///tmp/dbcert.parquet").display()

#### AVRO


Row-based storage, supports schema evolution and is common in data streaming and Kafka.

[reference](https://spark.apache.org/docs/latest/sql-data-sources-avro.html)

#### ORC

Columnar format optimized for Hive, great for compression and read performance.

[reference](https://spark.apache.org/docs/latest/sql-data-sources-orc.html)

## WRITER
In PySpark, the **writer** (`DataFrame.write`) follows the same standard interface for all data sinks. 
You always use **`.format("name")`** to specify the type (like `"csv"`, `"parquet"`, `"json"`, `"jdbc"`), 
and **`.option()`** to set the valid parameters for that specific sink.

Alternatively, for some popular formats like **Parquet**, **CSV**, and **JSON**, PySpark also provides direct methods like 
**`df.write.parquet()`**, **`df.write.csv()`**, and **`df.write.json()`**, which internally use the same **`.format()`** mechanism.

In [0]:
df_write = spark.read.option("sep", ",").option("header", True).option("inferSchema", True).csv(spark_uri).select(
    "first_name",
    "last_name"
)
df_write.display()

### LOCATIONS

#### LOCAL

You can save data locally in the Spark driver node by using the `file:///` prefix. 
For example, saving a DataFrame in Parquet format to your local filesystem:

```python
path = "file:///tmp/mydata.parquet"
```

#### DBFS

If you're working in Databricks, you can save files into the FileStore, which is a special location for accessing files via a URL (similar Hadoop).
```python
path = "/FileStore/labs/mydata_csv
```

### SAVE CSV AS REFERENCE

#### WAY 1 - SPECIFIC SOURCE AND PARAMS AS **KWARGS

In [0]:
df_write.write.csv("file:///tmp/dbcert2.csv", header=True)

In [0]:
%sh 
ls /tmp/dbcert2.csv

In [0]:
%sh
cat /tmp/dbcert2.csv/part-00000-tid-3512863244519527850-156d845d-dab8-450c-97a2-fcad938d0abf-29-1-c000.csv

#### WAY 2 - USING FORMAT

In [0]:
df_write.write.format("csv").option("sep", ";").option("header", True).save("file:///tmp/dbcert.csv3")

In [0]:
%sh
ls /tmp/dbcert.csv3

#### WAY 3 - USING MODE

When you use `.write`, you can specify the `mode` to control how the write operation behaves. Here are the most common values:

* `append` → Adds the new data to the existing files or table.
* `overwrite` → Replaces the existing data at the path or in the table.
* `ignore` → Does nothing if data already exists at the path or in the table (it skips writing).
* error or "errorifexists" → Throws an error if data already exists (this is the default mode).
* `fail` → Alias for `error`.

In [0]:
df_write.write.mode("append").format("csv").option("sep", ";").option("header", True).save("file:///tmp/dbcert.csv2")

In [0]:
ls /tmp/dbcert.csv2

#### WAY 4 - AS TABLE


![](https://i.postimg.cc/MTmpnW7x/dbo.png)

**Needs Hive to work if you are trying it locally**

In [0]:
df_write.write.mode("overwrite").saveAsTable("default.devcert")

In [0]:
%sql
DESCRIBE default.devcert;

In [0]:
%sql
SELECT * FROM default.devcert

In [0]:
dft = spark.sql("SELECT * FROM default.devcert")
dft.display()


#### WAY 5 - PARAMS AS DICT

In [0]:
params = {
    "sep": ";",
    "header": True,
    "inferSchema": True
}
df_write.write.mode("append").format("csv").options(**params).save("file:///tmp/dbcert.csv2")

### ADVANCED FORMATS

#### PARQUET

In [0]:
df_write.write.parquet("file:///tmp/dbcert.parquet")

In [0]:
%sh
ls -l /tmp/dbcert.parquet

In [0]:
%sh cat /tmp/dbcert.parquet/part-00000-tid-4268164983331311051-e6267233-c900-41a4-9307-6f86526cf547-41-1-c000.snappy.parquet

#### AVRO

In [0]:
df_write.write.format("avro").save("file:///tmp/dbcert.avro")

In [0]:
%sh
ls -l /tmp/dbcert.avro

In [0]:
%sh
cat  /tmp/dbcert.avro/part-00000-tid-951832466005025034-4a30fe45-7ef9-4eb0-8d7e-bb3df62ec72d-42-1-c000.snappy.avro

#### ORC

In [0]:
df_write.write.format("orc").save("file:///tmp/dbcert.orc")

In [0]:
%sh
ls -l /tmp/dbcert.orc

In [0]:
%sh
cat /tmp/dbcert.orc/part-00000-tid-2888970017771402040-5995028d-3052-4819-a6c6-c6608ccbd653-43-1-c000.snappy.orc

#### JSON

In [0]:
df_write.write.format("json").save("file:///tmp/dbcert.json")

In [0]:
%sh
ls -l /tmp/dbcert.json

In [0]:
%sh
cat /tmp/dbcert.json/part-00000-tid-8467736424276857177-522a1c79-8d22-43be-9263-d463ee53556e-44-1-c000.json

## USING SOME OF CONCEPTS 

These are file formats that store the schema information inside the file itself. When reading these formats in PySpark, Spark can automatically infer both the structure (columns, types) and the data — no need to manually define a schema like with CSV or text files.



In [0]:
from pyspark.sql.functions import explode, col, split, trim, lower, count

#### STAGE 1 - READ DATA

In [0]:
from pyspark.sql.functions import explode, col, split, trim, lower, count
params = {
    "sep": ",",
    "header": True,
    "inferSchema": True
}

# get data
stg_one = spark.read.format("csv").options(**params).load(spark_uri)\
    .select(
        "first_name",
        "last_name",
        explode(split(col("email_dependents"), ",")).alias("user_email")
    )

stg_one.display()

#### STAGE 2 - CLEAN DATA

In [0]:
stg_two = stg_one.select(
    "first_name",
    "last_name",
    lower(trim(col("user_email"))).alias("user_email_cleaned")
)
stg_two. display()


#### STAGE 3 - AGREGATION

In [0]:
stg_three = stg_two.groupBy("user_email_cleaned").agg(
    count("*").alias("email_count")
).orderBy("email_count", ascending=False).limit(3)
stg_three.display()

#### STAGE 4 - CHARTS

In [0]:
display(stg_three)

Databricks visualization. Run in Databricks to view.