# DAY 4: Spark Data Frame 
- Youtube Link: https://www.youtube.com/watch?v=02lSlhwLU4c

### What is a Spark Data Frame?
- A Spark DataFrame is like a distributed, in-memory table with named columns and schema.
- The Schema defines the columns and the data types for each column.
- Inspired by Pandas DataFrames!

### Creating our first data frame (with a schema)

In [None]:
from pyspark.sql.types
import StructType, StructField, StringType, IntegerType

data2 = [("Jack", "", "Eldridge", "12345", "M", 90000),
        ("Elliot", "Jordan", "Monro", "12346", "M", 47000),
        ("Sandra", "Faye", "Roberts", "12347", "F", 95000),
        ("Anne", "Oway", "Jones", "12348", "F", 78000)]

schema = StructType([
  StructField("firstname", StringType(), True),
  StructField("middlename", StringType(), True),
  StructField("lastname", StringType(), True),
  StructField("id", StringType(), True),
  StructField("gender", StringType(), True),
  StructField("salary", IntegerType(), True)
])

In [None]:
df = spark.createDataFrame(data = data2, schema = schema)
df.printSchema()
df.show(truncate = False)

In [None]:
type(df)

### Why Schemas?
- commonly used, especially with reading from an external data source (including files)
- Spark doesn't have to 'infer' the data type (which can be expensive for money, time, and processing power)
- you can detect errors early if the data doesn't match the schema

### Defining a Schema using Data Definition Language (DDL)

In [None]:
schema_ddl = "firstname STRING, middlename STRING, lastname STRING, id STRING, gender STRING, salary INT" 
df_with_ddl_schema = spark.createDataFrame(data = data2, schema = schema_ddl)
df.printSchema()