## Directed Acyclic Graph (DAG), Lazy Evaluation and Jobs in Spark

#### A DAG in Spark is a representation of the logical execution plan of a Spark application. 
#### It is a graph where nodes represent RDDs (or DataFrames/Datasets) and the operations (transformations) applied to them, 
#### while edges represent the flow of data and dependencies between these operations.


## RDD(Resilient Distributed Dataset)in Spark

#### RDD, or Resilient Distributed Dataset, is the fundamental data structure in Apache Spark. 
#### It represents an immutable, fault-tolerant, and distributed collection of elements that can be processed in parallel across a cluster of machines. 

## Parquet File

#### Columnar Storage: Unlike traditional row-based formats (like CSV or JSON), Parquet stores data in columns. This organization allows for:
#### Efficient Querying: When a query only needs specific columns, only those columns are read, significantly reducing I/O operations and improving query performance.
#### Better Compression: Similar data types are grouped together in columns, leading to more effective compression algorithms and reduced storage space.


## Writing DataFrames in Spark 

In [14]:
from pyspark.sql import SparkSession

In [21]:
spark = (
    SparkSession.builder \
        .appName("Partitioning and Bucketing") \
        .master("local[*]") \
        .getOrCreate()
)

25/11/14 23:32:01 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


In [29]:
df = spark.read.format("csv")\
    .option("header","true")\
    .option("inferschema","true")\
    .load("data_2.csv")

In [31]:
df.show()

+---+--------+---+------+-------+------+
| id|    name|age|salary|address|gender|
+---+--------+---+------+-------+------+
|  1|  Manish| 26| 75000|  INDIA|     m|
|  2|  Nikita| 23|100000|    USA|     f|
|  3|  Pritam| 22|150000|  INDIA|     m|
|  4|Prantosh| 17|200000|  JAPAN|     m|
|  5|  Vikash| 31|300000|    USA|     m|
|  6|   Rahul| 55|300000|  INDIA|     m|
|  7|    Raju| 67|540000|    USA|     m|
|  8| Praveen| 28| 70000|  JAPAN|     m|
|  9|     Dev| 32|150000|  JAPAN|     m|
| 10|  Sherin| 16| 25000| RUSSIA|     f|
| 11|    Ragu| 12| 35000|  INDIA|     f|
| 12|   Sweta| 43|200000|  INDIA|     f|
| 13| Raushan| 48|650000|    USA|     m|
| 14|  Mukesh| 36| 95000| RUSSIA|     m|
| 15| Prakash| 52|750000|  INDIA|     m|
+---+--------+---+------+-------+------+

