<a href="https://colab.research.google.com/github/JoaoPauloSarzedasRibeiro/data_analysis_with_Python/blob/main/Spark_Opera%C3%A7%C3%B5es_b%C3%A1sicas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Instalando o PySpark no Google Colab

In [2]:
# instalar as dependências necessárias para o Spark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
!tar xf spark-2.4.4-bin-hadoop2.7.tgz
!pip install -q findspark

In [3]:
# configurar as variáveis de ambiente
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.4-bin-hadoop2.7"

# tornar o pyspark "importável"
import findspark
findspark.init('spark-2.4.4-bin-hadoop2.7')

In [4]:
# iniciar uma sessão local chamada spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local[*]').getOrCreate()

In [5]:
# Importanto a biblioteca de funções
from pyspark.sql.functions import *

#Leitura dos dados e criação de um DataFrame

In [6]:
#Adicionando o arquivo CSV hospedado no GitHub para que o Spark consiga acessa-lo
url = 'https://raw.githubusercontent.com/JoaoPauloSarzedasRibeiro/data_analysis_with_Python/main/data/Salaries.csv'
from pyspark import SparkFiles
spark.sparkContext.addFile(url)

In [12]:
#Lendo os dados em CSV
df_csv = (
    spark
    .read
    .format('csv')
    .options(header=True, inferSchema=True,sep=',',encoding='latin1')
    .load(SparkFiles.get('Salaries.csv'))
    )

In [13]:
#Salvando os dados no formato parquet para manipulação com Spark
df_csv.write.format('parquet').mode('overwrite').save('df')

In [14]:
#Criando um novo DataFrame com os dados já no formato Parquet
df = spark.read.format('parquet').load('df')

#Colunas e Expressões

In [17]:
#Verificando o Schema do DataFrame
df.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- EmployeeName: string (nullable = true)
 |-- JobTitle: string (nullable = true)
 |-- BasePay: double (nullable = true)
 |-- OvertimePay: double (nullable = true)
 |-- OtherPay: double (nullable = true)
 |-- Benefits: double (nullable = true)
 |-- TotalPay: double (nullable = true)
 |-- TotalPayBenefits: double (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Notes: string (nullable = true)
 |-- Agency: string (nullable = true)
 |-- Status: double (nullable = true)



In [20]:
#Criando uma coluna a partir de outra
(
    df.select('EmployeeName','JobTitle','BasePay','TotalPay', )
    .withColumn("BasePay_day", round(col('BasePay')/365,2))
    .show(5)
)

+-----------------+--------------------+---------+---------+-----------+
|     EmployeeName|            JobTitle|  BasePay| TotalPay|BasePay_day|
+-----------------+--------------------+---------+---------+-----------+
|   NATHANIEL FORD|GENERAL MANAGER-M...|167411.18|567595.43|     458.66|
|     GARY JIMENEZ|CAPTAIN III (POLI...|155966.02|538909.28|      427.3|
|   ALBERT PARDINI|CAPTAIN III (POLI...|212739.13|335279.91|     582.85|
|CHRISTOPHER CHONG|WIRE ROPE CABLE M...|  77916.0|332343.61|     213.47|
|  PATRICK GARDNER|DEPUTY CHIEF OF D...| 134401.6|326373.19|     368.22|
+-----------------+--------------------+---------+---------+-----------+
only showing top 5 rows



**Forma "pandas" de selecionar colunas:**



1.   df['coluna']
2.   df.coluna

Não é a melhor opção para processamento contínuo, pois é necessário salvar um novo DataFrame caso seja necessário realizar duas operações com o df['coluna'] no mesmo bloco.


In [21]:
#Recriando o passo anterior utilizando o modo 'pandas' de selecinar a coluna
(
    df.select('EmployeeName','JobTitle','BasePay','TotalPay', )
    .withColumn("BasePay_day", round(df['BasePay']/365,2))
    .show(5)
)

+-----------------+--------------------+---------+---------+-----------+
|     EmployeeName|            JobTitle|  BasePay| TotalPay|BasePay_day|
+-----------------+--------------------+---------+---------+-----------+
|   NATHANIEL FORD|GENERAL MANAGER-M...|167411.18|567595.43|     458.66|
|     GARY JIMENEZ|CAPTAIN III (POLI...|155966.02|538909.28|      427.3|
|   ALBERT PARDINI|CAPTAIN III (POLI...|212739.13|335279.91|     582.85|
|CHRISTOPHER CHONG|WIRE ROPE CABLE M...|  77916.0|332343.61|     213.47|
|  PATRICK GARDNER|DEPUTY CHIEF OF D...| 134401.6|326373.19|     368.22|
+-----------------+--------------------+---------+---------+-----------+
only showing top 5 rows



In [23]:
#A estrutura abaixo apresenta um erro, pois a coluna criada em tempo de execução não será reconhecida. Seria reconhecida somente se salvassemos um nodo DataFrame com a coluna criada em tempo de execução.
(
    df.select('EmployeeName','JobTitle','BasePay','TotalPay', )
    .withColumn("BasePay_day", round(df['BasePay']/365,2))
    .withColumn("BasePay_day_plus2", df['BasePay_day']+2)
    .show(5)
)

AnalysisException: ignored

In [25]:
#Recriando o passo anterior utilizando o modo 'col'. Desta vez não ocorre o erro.
(
    df.select('EmployeeName','JobTitle','BasePay','TotalPay', )
    .withColumn("BasePay_day", round(col('BasePay')/365,2))
    .withColumn("BasePay_day_plus2", col('BasePay_day')+2)
    .show(5)
)

+-----------------+--------------------+---------+---------+-----------+-----------------+
|     EmployeeName|            JobTitle|  BasePay| TotalPay|BasePay_day|BasePay_day_plus2|
+-----------------+--------------------+---------+---------+-----------+-----------------+
|   NATHANIEL FORD|GENERAL MANAGER-M...|167411.18|567595.43|     458.66|           460.66|
|     GARY JIMENEZ|CAPTAIN III (POLI...|155966.02|538909.28|      427.3|            429.3|
|   ALBERT PARDINI|CAPTAIN III (POLI...|212739.13|335279.91|     582.85|           584.85|
|CHRISTOPHER CHONG|WIRE ROPE CABLE M...|  77916.0|332343.61|     213.47|           215.47|
|  PATRICK GARDNER|DEPUTY CHIEF OF D...| 134401.6|326373.19|     368.22|           370.22|
+-----------------+--------------------+---------+---------+-----------+-----------------+
only showing top 5 rows



**Expressões**

Fica a critério do programador utilizar a sintaxe anteriormente exposta ou a de expressões. O spark converte a sintaxe de colunas para expressão e elas acabam sendo equivalentes.

In [26]:
#Convertendo a expressão realizada anteriormente com sintaxe de colunas para a sintaxe de expressão
(
    df.select('EmployeeName','JobTitle','BasePay','TotalPay', )
    .withColumn("BasePay_day", expr('round(BasePay/365,2)'))
    .show(5)
)

+-----------------+--------------------+---------+---------+-----------+
|     EmployeeName|            JobTitle|  BasePay| TotalPay|BasePay_day|
+-----------------+--------------------+---------+---------+-----------+
|   NATHANIEL FORD|GENERAL MANAGER-M...|167411.18|567595.43|     458.66|
|     GARY JIMENEZ|CAPTAIN III (POLI...|155966.02|538909.28|      427.3|
|   ALBERT PARDINI|CAPTAIN III (POLI...|212739.13|335279.91|     582.85|
|CHRISTOPHER CHONG|WIRE ROPE CABLE M...|  77916.0|332343.61|     213.47|
|  PATRICK GARDNER|DEPUTY CHIEF OF D...| 134401.6|326373.19|     368.22|
+-----------------+--------------------+---------+---------+-----------+
only showing top 5 rows



#Seleção de Colunas e Filtragem de Linhas

In [31]:
#Analisando quais colunas existem no DataFrame
df.columns

['Id',
 'EmployeeName',
 'JobTitle',
 'BasePay',
 'OvertimePay',
 'OtherPay',
 'Benefits',
 'TotalPay',
 'TotalPayBenefits',
 'Year',
 'Notes',
 'Agency',
 'Status']

In [30]:
#Seleção simples de colunas, utilizando o método select e o nome das colunas
df.select('EmployeeName','JobTitle','BasePay','TotalPay').show(5)

+-----------------+--------------------+---------+---------+
|     EmployeeName|            JobTitle|  BasePay| TotalPay|
+-----------------+--------------------+---------+---------+
|   NATHANIEL FORD|GENERAL MANAGER-M...|167411.18|567595.43|
|     GARY JIMENEZ|CAPTAIN III (POLI...|155966.02|538909.28|
|   ALBERT PARDINI|CAPTAIN III (POLI...|212739.13|335279.91|
|CHRISTOPHER CHONG|WIRE ROPE CABLE M...|  77916.0|332343.61|
|  PATRICK GARDNER|DEPUTY CHIEF OF D...| 134401.6|326373.19|
+-----------------+--------------------+---------+---------+
only showing top 5 rows



In [36]:
#Seleção utilizando a lista de colunas que o nome tenham a string 'Pay'
select_cols = [c for c in df.columns if c.find('Pay') != -1]

df.select(select_cols).show(5)

+---------+-----------+---------+---------+----------------+
|  BasePay|OvertimePay| OtherPay| TotalPay|TotalPayBenefits|
+---------+-----------+---------+---------+----------------+
|167411.18|        0.0|400184.25|567595.43|       567595.43|
|155966.02|  245131.88|137811.38|538909.28|       538909.28|
|212739.13|  106088.18|  16452.6|335279.91|       335279.91|
|  77916.0|   56120.71| 198306.9|332343.61|       332343.61|
| 134401.6|     9737.0|182234.59|326373.19|       326373.19|
+---------+-----------+---------+---------+----------------+
only showing top 5 rows



In [37]:
# Mesclando as duas formas de selelecionar colunas
# Destacando que o '*' ao lado do select_cols indica que este objeto deve ser considerado um argumento posicional
df.select('EmployeeName', *select_cols).show(5)

+-----------------+---------+-----------+---------+---------+----------------+
|     EmployeeName|  BasePay|OvertimePay| OtherPay| TotalPay|TotalPayBenefits|
+-----------------+---------+-----------+---------+---------+----------------+
|   NATHANIEL FORD|167411.18|        0.0|400184.25|567595.43|       567595.43|
|     GARY JIMENEZ|155966.02|  245131.88|137811.38|538909.28|       538909.28|
|   ALBERT PARDINI|212739.13|  106088.18|  16452.6|335279.91|       335279.91|
|CHRISTOPHER CHONG|  77916.0|   56120.71| 198306.9|332343.61|       332343.61|
|  PATRICK GARDNER| 134401.6|     9737.0|182234.59|326373.19|       326373.19|
+-----------------+---------+-----------+---------+---------+----------------+
only showing top 5 rows



In [39]:
# Selecionando todas as colunas de uma vez utilizando o *

df.select('*').show(5)

+---+-----------------+--------------------+---------+-----------+---------+--------+---------+----------------+----+-----+-------------+------+
| Id|     EmployeeName|            JobTitle|  BasePay|OvertimePay| OtherPay|Benefits| TotalPay|TotalPayBenefits|Year|Notes|       Agency|Status|
+---+-----------------+--------------------+---------+-----------+---------+--------+---------+----------------+----+-----+-------------+------+
|  1|   NATHANIEL FORD|GENERAL MANAGER-M...|167411.18|        0.0|400184.25|    null|567595.43|       567595.43|2011| null|San Francisco|  null|
|  2|     GARY JIMENEZ|CAPTAIN III (POLI...|155966.02|  245131.88|137811.38|    null|538909.28|       538909.28|2011| null|San Francisco|  null|
|  3|   ALBERT PARDINI|CAPTAIN III (POLI...|212739.13|  106088.18|  16452.6|    null|335279.91|       335279.91|2011| null|San Francisco|  null|
|  4|CHRISTOPHER CHONG|WIRE ROPE CABLE M...|  77916.0|   56120.71| 198306.9|    null|332343.61|       332343.61|2011| null|San Fra

**Realizando operações dentro do Select**

In [56]:
df.select('EmployeeName', lower(col('JobTitle')).alias('JobTitle'), 'BasePay', 'TotalPay').show(5)

+-----------------+--------------------+---------+---------+
|     EmployeeName|            JobTitle|  BasePay| TotalPay|
+-----------------+--------------------+---------+---------+
|   NATHANIEL FORD|general manager-m...|167411.18|567595.43|
|     GARY JIMENEZ|captain iii (poli...|155966.02|538909.28|
|   ALBERT PARDINI|captain iii (poli...|212739.13|335279.91|
|CHRISTOPHER CHONG|wire rope cable m...|  77916.0|332343.61|
|  PATRICK GARDNER|deputy chief of d...| 134401.6|326373.19|
+-----------------+--------------------+---------+---------+
only showing top 5 rows



In [55]:
#Utilizando o metodo selectExpr para interpretar tudo dentro do select como expressões
df.selectExpr('EmployeeName', 'lower(JobTitle) as JobTitle', 'BasePay', 'TotalPay').show(5)

+-----------------+--------------------+---------+---------+
|     EmployeeName|            JobTitle|  BasePay| TotalPay|
+-----------------+--------------------+---------+---------+
|   NATHANIEL FORD|general manager-m...|167411.18|567595.43|
|     GARY JIMENEZ|captain iii (poli...|155966.02|538909.28|
|   ALBERT PARDINI|captain iii (poli...|212739.13|335279.91|
|CHRISTOPHER CHONG|wire rope cable m...|  77916.0|332343.61|
|  PATRICK GARDNER|deputy chief of d...| 134401.6|326373.19|
+-----------------+--------------------+---------+---------+
only showing top 5 rows



**Selecionando valores distintos**

In [57]:
df.select('JobTitle').distinct().show()

+--------------------+
|            JobTitle|
+--------------------+
|MANAGER, UNIFIED ...|
|COURT ALTERNATIVE...|
|Transit Informati...|
|Publ Svc Aide-Ass...|
|       CITY ATTORNEY|
|HEALTH PROGRAM CO...|
|PERFORMANCE ANALY...|
|HEAVY EQUIPMENT O...|
|          Dep Dir IV|
|          Manager VI|
|Communications Li...|
|Custodial Assista...|
|Transit Fare Insp...|
|Public Service Ai...|
|BATTALION CHIEF, ...|
|ASSOCIATE PERFORM...|
|SENIOR ESTATE INV...|
|              ROOFER|
|        Undersheriff|
|        Dept Head II|
+--------------------+
only showing top 20 rows



In [61]:
#Buscando por linhas unicas
df.distinct().show(5)

+---+-------------+--------------------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
| Id| EmployeeName|            JobTitle|  BasePay|OvertimePay|OtherPay|Benefits| TotalPay|TotalPayBenefits|Year|Notes|       Agency|Status|
+---+-------------+--------------------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
|181|DONALD BRYANT|ELECTRICAL TRANSI...| 77580.36|   119407.7|11220.57|    null|208208.63|       208208.63|2011| null|San Francisco|  null|
|443|TROY WILLIAMS|       NURSE MANAGER| 161044.0|        0.0| 27377.5|    null| 188421.5|        188421.5|2011| null|San Francisco|  null|
|543|DEBORAH LOGAN|       NURSE MANAGER| 158223.4|        0.0|26661.89|    null|184885.29|       184885.29|2011| null|San Francisco|  null|
|744|     EDDY WOO|         FIREFIGHTER|105934.67|   49733.97| 21851.5|    null|177520.14|       177520.14|2011| null|San Francisco|  null|
|902|SCOTT SANDINE| 

In [62]:
#ou removendo duplicatas da selecão
df.dropDuplicates().show(5)

+---+-------------+--------------------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
| Id| EmployeeName|            JobTitle|  BasePay|OvertimePay|OtherPay|Benefits| TotalPay|TotalPayBenefits|Year|Notes|       Agency|Status|
+---+-------------+--------------------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
|181|DONALD BRYANT|ELECTRICAL TRANSI...| 77580.36|   119407.7|11220.57|    null|208208.63|       208208.63|2011| null|San Francisco|  null|
|443|TROY WILLIAMS|       NURSE MANAGER| 161044.0|        0.0| 27377.5|    null| 188421.5|        188421.5|2011| null|San Francisco|  null|
|543|DEBORAH LOGAN|       NURSE MANAGER| 158223.4|        0.0|26661.89|    null|184885.29|       184885.29|2011| null|San Francisco|  null|
|744|     EDDY WOO|         FIREFIGHTER|105934.67|   49733.97| 21851.5|    null|177520.14|       177520.14|2011| null|San Francisco|  null|
|902|SCOTT SANDINE| 

**Filtros**

Operadores lógicos:

* e: &
* ou: |
* não: ~

**Filtros com uma condição**

In [82]:
(
    df.filter(col('JobTitle') == 'FIREFIGHTER')
    .show(5)
)

+---+----------------+-----------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
| Id|    EmployeeName|   JobTitle|  BasePay|OvertimePay|OtherPay|Benefits| TotalPay|TotalPayBenefits|Year|Notes|       Agency|Status|
+---+----------------+-----------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
| 44|MICHAEL THOMPSON|FIREFIGHTER|123013.02|  111729.65|15575.26|    null|250317.93|       250317.93|2011| null|San Francisco|  null|
|103| LAUIFI SEUMAALA|FIREFIGHTER|105934.69|   98534.35|18890.96|    null| 223360.0|        223360.0|2011| null|San Francisco|  null|
|105|   PATRIC STEELE|FIREFIGHTER|105934.64|   97395.59|18760.77|    null| 222091.0|        222091.0|2011| null|San Francisco|  null|
|106|   MICHAEL WALSH|FIREFIGHTER|110474.93|   83670.04|27043.61|    null|221188.58|       221188.58|2011| null|San Francisco|  null|
|110|  SCOTT SCHOLZEN|FIREFIGHTER|105934.67|   96154.33|18655.

In [83]:
(
    df.filter(col('BasePay') >  310000)
    .show(5)
)

+-----+--------------------+--------------------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
|   Id|        EmployeeName|            JobTitle|  BasePay|OvertimePay|OtherPay|Benefits| TotalPay|TotalPayBenefits|Year|Notes|       Agency|Status|
+-----+--------------------+--------------------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
|72926|      Gregory P Suhr|     Chief of Police|319275.01|        0.0|20007.06|86533.21|339282.07|       425815.28|2013| null|San Francisco|  null|
|72927|Joanne M Hayes-White|Chief, Fire Depar...|313686.01|        0.0| 23236.0|85431.39|336922.01|        422353.4|2013| null|San Francisco|  null|
|72930|       Robert L Shaw|Dep Dir for Inves...|315572.01|        0.0|     0.0|82849.66|315572.01|       398421.67|2013| null|San Francisco|  null|
|72932|   Harlan L Kelly-Jr|Executive Contrac...|313312.52|        0.0|     0.0|82319.51|313312.52|       

**Filtros com duas ou mais condições**

In [84]:
(
    df.filter((col('JobTitle') == 'FIREFIGHTER') & (col('BasePay') > 125000))
    .show(5)
)

+----+-----------------+-----------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
|  Id|     EmployeeName|   JobTitle|  BasePay|OvertimePay|OtherPay|Benefits| TotalPay|TotalPayBenefits|Year|Notes|       Agency|Status|
+----+-----------------+-----------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
|1537|  THERESA FOGARTY|FIREFIGHTER|126873.14|     511.56|33199.08|    null|160583.78|       160583.78|2011| null|San Francisco|  null|
|2911|   EUGENE EDEN-JR|FIREFIGHTER|125386.06|        0.0|16638.69|    null|142024.75|       142024.75|2011| null|San Francisco|  null|
|3512|RAYMOND POYDESSUS|FIREFIGHTER|125386.11|        0.0| 10283.4|    null|135669.51|       135669.51|2011| null|San Francisco|  null|
|3516|     PAUL ORLANDO|FIREFIGHTER|126873.18|        0.0| 8768.22|    null| 135641.4|        135641.4|2011| null|San Francisco|  null|
|3590|    KETTY FEDIGAN|FIREFIGHTER|126873.07|  

In [85]:
# Escrevendo de uma forma menos verbosa
(
    df
 .filter(col('JobTitle') == 'FIREFIGHTER')
 .filter(col('BasePay') > 125000)
 .show(5)
)

+----+-----------------+-----------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
|  Id|     EmployeeName|   JobTitle|  BasePay|OvertimePay|OtherPay|Benefits| TotalPay|TotalPayBenefits|Year|Notes|       Agency|Status|
+----+-----------------+-----------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
|1537|  THERESA FOGARTY|FIREFIGHTER|126873.14|     511.56|33199.08|    null|160583.78|       160583.78|2011| null|San Francisco|  null|
|2911|   EUGENE EDEN-JR|FIREFIGHTER|125386.06|        0.0|16638.69|    null|142024.75|       142024.75|2011| null|San Francisco|  null|
|3512|RAYMOND POYDESSUS|FIREFIGHTER|125386.11|        0.0| 10283.4|    null|135669.51|       135669.51|2011| null|San Francisco|  null|
|3516|     PAUL ORLANDO|FIREFIGHTER|126873.18|        0.0| 8768.22|    null| 135641.4|        135641.4|2011| null|San Francisco|  null|
|3590|    KETTY FEDIGAN|FIREFIGHTER|126873.07|  

**Filtros utilizando expressões**

In [87]:
(
    df.filter('JobTitle == "FIREFIGHTER" and BasePay >  125000')
    .show(5)
)

+----+-----------------+-----------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
|  Id|     EmployeeName|   JobTitle|  BasePay|OvertimePay|OtherPay|Benefits| TotalPay|TotalPayBenefits|Year|Notes|       Agency|Status|
+----+-----------------+-----------+---------+-----------+--------+--------+---------+----------------+----+-----+-------------+------+
|1537|  THERESA FOGARTY|FIREFIGHTER|126873.14|     511.56|33199.08|    null|160583.78|       160583.78|2011| null|San Francisco|  null|
|2911|   EUGENE EDEN-JR|FIREFIGHTER|125386.06|        0.0|16638.69|    null|142024.75|       142024.75|2011| null|San Francisco|  null|
|3512|RAYMOND POYDESSUS|FIREFIGHTER|125386.11|        0.0| 10283.4|    null|135669.51|       135669.51|2011| null|San Francisco|  null|
|3516|     PAUL ORLANDO|FIREFIGHTER|126873.18|        0.0| 8768.22|    null| 135641.4|        135641.4|2011| null|San Francisco|  null|
|3590|    KETTY FEDIGAN|FIREFIGHTER|126873.07|  

**Métodos mais utilizados:**

Quando utilizamos a função col(), temos acesso à diversos métodos que podem facilitar a filtragem de dados no DataFrame. Alguns deles são:



*  `isin()`: checa se a coluna contém os valores listados
* `contains()`: para verificar se contem algum padrão especificado (não aceita regex)
* `like()`: similar ao LIKE do SQL, utilizado para verificar se a coluna de texto contém algum padrão específicado (não aceita regex)
* `rlike()`: similar ao RLIKE do SQL (aceita regex)
* `startswith()`: aceita regex
* `endswith()`: aceita regex
* `between()`: checa se os valores da coluna estão dentro do intervalo especificado (os dois lados inclusivos)
* `isNull()`
* `isNotNull()`
* `alias()/name()`: usado para renomear as colunas em tempo de operação
* `astype()/cast()`: usado para mudar o tipo das colunas
* `substr()`: utilizado para cortar uma string com base em índices