In [14]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

#creating a spark session
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

#creating list of tuples
tuples_list = [
    (1, "Alice", 29),
    (2, "Bob", 35),
    (3, "Charlie", 40)
]

#defining the explicit schema using structfield and structtype
schema = StructType([
    StructField("ID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True)
])

#createing a dataframe for the list of tuples
df = spark.createDataFrame(tuples_list, schema=schema)

#printing the schema
df.printSchema()

#print ing the dataframe
df.show()

root
 |-- ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)

+---+-------+---+
| ID|   Name|Age|
+---+-------+---+
|  1|  Alice| 29|
|  2|    Bob| 35|
|  3|Charlie| 40|
+---+-------+---+



Key Learning :-

Purpose of SparkSession :- It acts like an entry point to start spark functionality like dataframes etc

key concepts of that spark session are :
1. Builder ->Starts building the SparkSession.
2. appName("example")->Assigns a name to your Spark job
3. getOrCreate():-
  Reuse: Returns an existing SparkSession if one exists.

  Create: Builds a new one if none exists.

PySpark has two primary distributed data structures for processing data:

1. RDD (Resilient Distributed Dataset)
2. DataFrame

RDD->
Schema ->No schema (unstructured) ,

Optimizations -> None ,

Performance	-> Slower (JVM serialization),

Use Case	-> Custom algorithms, raw data

DataFrame:

Schema ->Explicit schema (structured) ,

Optimizations -> Catalyst + Tungsten optimizations ,

Performance	-> Faster (columnar storage),

Use Case	-> SQL analytics, structured pipelines


Schema:
There are 3 mostly used schemas they are
1. Explicit Schema (StructType) : Defined manually using StructType and StructField.
2. Inferred Schema : Spark automatically guesses the schema from data.
3. DDL Schema (SQL-like String) : Define schema using a SQL DDL-formatted string.


 Data Display Methods:

1. show() -> Shows 20 rows, truncates long text
2. show(n) -> Shows first n rows
3. show(truncate=False) -> Displays full cell content
4. show(vertical=True) -> Displays rows vertically
5. printSchema() -> Shows column names and types
6. head() -> Returns list of Row objects
7. take() -> 	Returns list of Row objects
8. collect() -> Returns all data as list