<p><img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/1e/UNAL_Logosimbolo.svg/583px-UNAL_Logosimbolo.svg.png" alt="" width="1280" height="300" /></p>


# SELECT COLUMNS IN SPARK DATAFRAME 

In Spark, there are various ways to select data from a DataFrame—some resemble the familiar syntax from pandas, while others are unique to Spark’s distributed computing model, offering powerful and scalable options for data manipulation.

In [0]:
elements = [
    {"id": 1, "name": "July", "age": 34, "salary": 550, "role": "admin"},
    {"id": 2, "name": "Gabriel", "age": 29, "salary": 720, "role": "developer"},
    {"id": 3, "name": "Luis", "age": 42, "salary": 610, "role": "developer"},
    {"id": 4, "name": "John", "age": 51, "salary": 890, "role": "manager"},
    {"id": 5, "name": "Daniel", "age": 27, "salary": 480, "role": "developer"},
    {"id": 6, "name": "Mary", "age": 38, "salary": 700, "role": "admin"},
    {"id": 7, "name": "Monica", "age": 33, "salary": 460, "role": "tester"},
    {"id": 8, "name": "Andrea", "age": 45, "salary": 680, "role": "admin"},
    {"id": 9, "name": "Sebastian", "age": 31, "salary": 530, "role": "developer"},
    {"id": 10, "name": "Johana", "age": 26, "salary": 410, "role": "tester"}
]

df = spark.createDataFrame(elements)
df.display()

## SIMPLE

### TEXT

#### ALL

In [0]:
df.select("*").display()

#### ESPECIFIC COLUMNS

In [0]:
df.select("id", "age").display()

### LIST

In [0]:
df.select(["id", "age"]).display()

### ALIAS

In [0]:
df.alias("trn").select("trn.*").display()

### SELECTEXP

Solo se pueden hacer transformaciones de una sola linea

In [0]:
df.selectExpr("age AS edad", "id AS identificador", "CAST(id AS FLOAT) idx", "True as active").show()

## REFERENCING

### PANDAS TYPE

In [0]:
df.select(df["role"], df["role"].isin("admin").alias("is_admin")).display()

### ATRIBUTE

In [0]:
df.select(df.role, df.role.isin("admin").alias("is_admin")).display()

## COL

To perform this process, the `col` function from the `pyspark.sql.functions import col` package is used.

In [0]:
from pyspark.sql.functions import col

### SIMPLE

In [0]:
df.select(col("age")).display()

### MIX

In [0]:
df.select(col("age"), "salary").display()

### ALIAS

In [0]:
df.select(col("age").alias("edad")).display()

### ARITHMETIC OPERATIONS

In [0]:
df.select(
    col("salary") + 1000,          
    col("salary") - 500,           
    col("age") * 2,                
    col("salary") / 2,             
    col("salary") % 3              
).display()

### CASTING

In [0]:
df.select(
    col("age").cast("string"), 
    col("salary").cast("double")
).display()

### NULLS



#### IS NULL

In [0]:
df.select(col("name").isNull()).display()

#### IS NOT NULL

In [0]:
df.select(col("name").isNotNull()).display()

### SORT

In [0]:
df.display()

#### DESC

In [0]:
df.sort(col("age").desc()).display()

#### ASC

In [0]:
df.sort(col("age").asc()).display()

### BETWEEN

In [0]:
df.select(col("age").between(1,10)).display()

In [0]:
df.select(col("age").between(25, 35)).display()

### DTYPES


In [0]:
df.select("*").dtypes

### COMMON TEXT
las que devuelvan booleanos, se deben usar en filtros

#### STARTSWITH

In [0]:
df.select(col("name").startswith("J")).display()

#### ENDSWITH

In [0]:
df.select(col("name").endswith("a")).display()

#### CONTAINS

In [0]:
df.select(col("name").contains("an")).display()

#### IS IN

In [0]:
df.select(col("name").isin("July", "John", "Mary")).display()

#### LIKE

In [0]:
df.select(col("name").like("J%")).display()

#### RLIKE

In [0]:
df.select(col("name").rlike("J.*y")).display()

#### SUB STR

In [0]:
df.select(col("name").substr(1, 3)).display()

## FILTERING

To perform filters, two functions are used: `filter` or `where`; both are equivalent. The filtering methods are common, nothing unusual.

* `==` equal to
* `!=` different
* `>` greater than
* `<` less than
* `>=` greater than equal
* `<=` less than equal

### AS COLUMN

In [0]:
df.filter(col("age") > 30).display()

In [0]:
df.where(col("age") > 30).display()

### ANY COLUMN FUNCTION

In [0]:
df.filter(col("role").isin("admin", "tester")).display()

### AND

The `&` symbol is used and each condition must be enclosed in parentheses.

In [0]:
df.filter((col("age") > 30) & (col("salary") > 500)).display()

### OR
The symbol `|` is used and each condition must be enclosed in parentheses


In [0]:
df.filter((col("role") == "admin") | (col("role") == "developer")).display()

### NOT

The `~` symbol is used and each condition must be enclosed in parentheses.

In [0]:
df.filter(~(col("role") == "developer")).display()

### AS TEXT

In [0]:
df.filter("salary between 500 AND 700").display()

### MONOTONICALLY INCREASING ID
generates a unique ID per row that increases monotonically, but is not guaranteed to be sequential or contiguous.

In [0]:
from pyspark.sql.functions import monotonically_increasing_id
df.select(monotonically_increasing_id()).display()

### ROW NUMBER

In [0]:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.orderBy("name")  # or any other column
df_with_key = df.withColumn("surrogate_key", row_number().over(window_spec))
df_with_key.display()

## AS SQL 

#### createOrReplaceTempView
* Creates a temporary view local to the current Spark Session.
* It is only available in the active Spark session where it was created.
* It is not shared across sessions or notebooks.

In [0]:
df.createOrReplaceTempView("training")

In [0]:
display(spark.sql("select * from training"))

In [0]:
%sql
select * from training

#### createOrReplaceGlobalTempView

* Creates a global temporal view, recorded in a special database called global_temp.

* It is available to all active Spark sessions in the same application (for example, from different notebooks on the same cluster).

In [0]:
df.createOrReplaceGlobalTempView("training")