<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/UNAL_Logosimbolo.svg/583px-UNAL_Logosimbolo.svg.png" alt="" width="1280" height="300" /></p>

# CREATE SPARK SCHEMA

You should define a schema using StructType and StructField in Spark when:

- It's necessary to explicitly control column names and data types, especially when reading from formats like CSV or JSON, which are text-based and schema-less.
- It helps prevent data type inference errors and ensures that data is read consistently.
- It can improve performance, as Spark doesn’t need to infer the schema.
- It's useful for validating data structure in pipelines where data quality matters.

However, when working with binary columnar formats like `Parquet`, it's often not required, because these formats store their schema internally and Spark can read it automatically.

In [0]:
from pyspark.sql.types import StructField, StructType, StringType, IntegerType

## TEXT

In [0]:
schema = """
id INT,
name STRING
"""

## FIELD TYPE

### DEFINITION
> 
**1. FIELD TYPE**

| Parameter | Description |
|:----------|:------------|
| **name** | Name of the column. |
| **dataType** | Data type (`StringType`, `IntegerType`, etc.). |
| **nullable** | `True` or `False` (whether the column can have null values). |
| **metadata** | Optional dictionary to store additional information. |

Note: In metadata you can add whatever you want on key value format
Sintax:

```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

    StructField(parameters)

```

In [0]:


field_schema = StructField(
    name="user_name",
    nullable=False,
    dataType=StringType(),
    metadata={"description": "user name info", "source": "training"}
)


### MAIN OPTIONS

#### SIMPLE

In [0]:
print(field_schema)

#### DATATYPE

In [0]:
field_schema.dataType

#### METADATA

In [0]:
field_schema.metadata

#### NAME


In [0]:
field_schema.name

#### JSON

In [0]:
field_schema.json()

#### SIMPLE STRING


In [0]:
field_schema.simpleString()

#### TYPE NAME

In [0]:
field_schema.typeName

## STRUCT TYPE


### DEFINITION


****

| Parameter | Description |
|:----------|:------------|
| **fields** | A list of `StructField` objects that define the schema (columns, their data types, and properties). It is optional; you can start with `None` and add fields later. |

Sintax:

```python

from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField(...)
])

```



In [0]:
schema = StructType(
    [
        StructField('user_name', StringType(), False),
        StructField('age', IntegerType(), False)
    ]
)

### MAIN OPTIONS

#### SIMPLE

In [0]:
print(schema)

#### TEXT

In [0]:
print(schema.simpleString())

%md
#### THREE TEXT

In [0]:
print(schema.treeString())

#### JSON

In [0]:
print(schema.json())

#### FIELDS

In [0]:
print(schema.fields)

#### NAMES

In [0]:
print(schema.names)

#### FROM DDL

In [0]:
print(schema.fromDDL('a INT, b STRING'))

#### TYPE NAME

In [0]:
print(schema.typeName)

## ROW

Row is a container for a record (like a Python named tuple).

Each Row represents one row of data in a DataFrame.

Internally, DataFrames are collections of Row objects.

In [0]:
from pyspark.sql import Row


### DEFINITION



In [0]:
row = Row(name="training", tags=[1,2,3])
row


### MAIN OPTIONS

#### SET ATRIBUTES

In [0]:
row.name

In [0]:
row.tags

#### ASDICT

In [0]:
row.asDict()

#### COUNT

In [0]:
row.count("training")

#### INDEX

In [0]:
row.index("training")

#### ROW VALIDATION



In [0]:
Row(name="training", tags=[1,2,3]) == Row(name="training", tags=[1,2,3])


#### CONVERT TUPLES TO ROW

In [0]:
elements = [
    (1, 'user_1'),
    (2, 'user_2'),
    (3, 'user_3')
]


elements = [Row(*element) for element in elements]
print(elements)

#### CONVERT DICT TO ROW

In [0]:
elements = [
    {
    "users": "user_a",
    "ages": 1,
    "name": "test"
    },
    {
    "users": "user_b",
    "ages": 2,
    "name": "test_2"
    },
]

elements = [Row(**element) for element in elements]
print(elements)