### Expresiones SQL
##### Tambien podemos usar la expresion SQL para la manipulacion de datos.
##### Tenemos la funcion **expr** y tambien una variante de un metodo de seleccion como **selectExpr** para la evaluacion de expresiones SQL

In [2]:
import findspark
findspark.init()

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

#### Crea la sesion de SparkSession

In [4]:
spark = SparkSession.builder.getOrCreate()

#### Creamos el dataframe

In [5]:
data = [(1, "AAA", "dept1", 1000),
        (2, "BBB", "dept1", 1100),
        (3, "CCC", "dept1", 3000),
        (4, "DDD", "dept1", 1500),
        (5, "EEE", "dept2", 8000),
        (6, "FFF", "dept2", 7200),
        (7, "GGG", "dept3", 7100),
        (8, "HHH", "dept3", 3700),
        (9, "III", "dept3", 4500),
        (10, "JJJ", "dept5", 3400),
        (11, "KKK", "dept5", 3100),
        (12, "KFK", "dept5", 3100),
        (13, "KKF", "dept5", 3100),
        (14, "KBV", "dept6", 4100),
        (15, None, None, 1000),
        (16, None, None, 1000),
        (17, None, None, 1000),
        (18, "TKA", None, 1000),
        (19, None, "dept7", 1000),
        (20, None, "dept8", 1000)]

dept = [("dept1", "Departament - 1"),
        ("dept2", "Departament - 2"),
        ("dept3", "Departament - 3"),
        ("dept4", "Departament - 4")]

columns =  ["id", "name", "dept", "salary"]

In [6]:
df = spark.createDataFrame(data, columns)
df2 = spark.createDataFrame(dept, columns[:2])

In [7]:
from pyspark.sql.functions import expr

# Intentamos categorizar el salario en Baja, Medio y Alto segun la categorizacion a continuacion
# 0 - 2000 : Salario Bajo (low)
# 2001 - 5000 : Salario Medio (mid)
# > 5001 : Salario Alto (high)

cond = """ case when salary > 5000 then 'high_salary'
            else case when salary > 2000 then 'mid_salary'
                else case when salary > 0 then 'low_salary'
                    else 'invalid_salary' 
                        end
                    end
            end as salary_level"""

newdf = df.withColumn("salary_level", expr(cond))
newdf.show()

+---+----+-----+------+------------+
| id|name| dept|salary|salary_level|
+---+----+-----+------+------------+
|  1| AAA|dept1|  1000|  low_salary|
|  2| BBB|dept1|  1100|  low_salary|
|  3| CCC|dept1|  3000|  mid_salary|
|  4| DDD|dept1|  1500|  low_salary|
|  5| EEE|dept2|  8000| high_salary|
|  6| FFF|dept2|  7200| high_salary|
|  7| GGG|dept3|  7100| high_salary|
|  8| HHH|dept3|  3700|  mid_salary|
|  9| III|dept3|  4500|  mid_salary|
| 10| JJJ|dept5|  3400|  mid_salary|
| 11| KKK|dept5|  3100|  mid_salary|
| 12| KFK|dept5|  3100|  mid_salary|
| 13| KKF|dept5|  3100|  mid_salary|
| 14| KBV|dept6|  4100|  mid_salary|
| 15|null| null|  1000|  low_salary|
| 16|null| null|  1000|  low_salary|
| 17|null| null|  1000|  low_salary|
| 18| TKA| null|  1000|  low_salary|
| 19|null|dept7|  1000|  low_salary|
| 20|null|dept8|  1000|  low_salary|
+---+----+-----+------+------------+



#### Uso de la funcion **selectExpr**

In [8]:
newdf2 = df.selectExpr("*",cond)
newdf2.show()

+---+----+-----+------+------------+
| id|name| dept|salary|salary_level|
+---+----+-----+------+------------+
|  1| AAA|dept1|  1000|  low_salary|
|  2| BBB|dept1|  1100|  low_salary|
|  3| CCC|dept1|  3000|  mid_salary|
|  4| DDD|dept1|  1500|  low_salary|
|  5| EEE|dept2|  8000| high_salary|
|  6| FFF|dept2|  7200| high_salary|
|  7| GGG|dept3|  7100| high_salary|
|  8| HHH|dept3|  3700|  mid_salary|
|  9| III|dept3|  4500|  mid_salary|
| 10| JJJ|dept5|  3400|  mid_salary|
| 11| KKK|dept5|  3100|  mid_salary|
| 12| KFK|dept5|  3100|  mid_salary|
| 13| KKF|dept5|  3100|  mid_salary|
| 14| KBV|dept6|  4100|  mid_salary|
| 15|null| null|  1000|  low_salary|
| 16|null| null|  1000|  low_salary|
| 17|null| null|  1000|  low_salary|
| 18| TKA| null|  1000|  low_salary|
| 19|null|dept7|  1000|  low_salary|
| 20|null|dept8|  1000|  low_salary|
+---+----+-----+------+------------+



In [21]:
newdf2 = df.selectExpr("*",cond).groupBy('salary_level').count().show()

+------------+-----+
|salary_level|count|
+------------+-----+
| high_salary|    3|
|  low_salary|    9|
|  mid_salary|    8|
+------------+-----+



### Funcion definida por el usuario (UDF)
##### A menudo necesitamos escribir la funcion en funcion de nuestro **REQUISITO** muy especifico. 
##### Aqui podemos aprovechar los UDFs. 
##### Podemos escribir nuestras propias funciones en un lenguaje como python y registrar la funcion como udf, luego podemos usar la funcion para operaciones de dataframe

In [28]:
def datSalary_Level(sal):
    level = None
    if(sal > 5000):
        level = 'high_salary'
    elif(sal > 2000):
        level = 'mid_salary'
    elif(sal > 0):
        level = 'low_salary'
    else:
        level = 'invalid_salary'
    return level

def datSalary_Brute(sal):
    brute = sal * 31
    return brute

- Luego registre la funcion datSalary_Level como UDF

In [31]:
sal_level = udf(datSalary_Level, StringType())
sal_brute = udf(datSalary_Brute, IntegerType())

#### Luego se aplica para determinar el salario level para un salario dado.

In [32]:
newdf_udf = df.withColumn('salary_level', sal_level("salary"))
newdf_udf = df.withColumn('salary_brute', sal_brute('salary'))
newdf_udf.show()

+---+----+-----+------+------------+
| id|name| dept|salary|salary_brute|
+---+----+-----+------+------------+
|  1| AAA|dept1|  1000|       31000|
|  2| BBB|dept1|  1100|       34100|
|  3| CCC|dept1|  3000|       93000|
|  4| DDD|dept1|  1500|       46500|
|  5| EEE|dept2|  8000|      248000|
|  6| FFF|dept2|  7200|      223200|
|  7| GGG|dept3|  7100|      220100|
|  8| HHH|dept3|  3700|      114700|
|  9| III|dept3|  4500|      139500|
| 10| JJJ|dept5|  3400|      105400|
| 11| KKK|dept5|  3100|       96100|
| 12| KFK|dept5|  3100|       96100|
| 13| KKF|dept5|  3100|       96100|
| 14| KBV|dept6|  4100|      127100|
| 15|null| null|  1000|       31000|
| 16|null| null|  1000|       31000|
| 17|null| null|  1000|       31000|
| 18| TKA| null|  1000|       31000|
| 19|null|dept7|  1000|       31000|
| 20|null|dept8|  1000|       31000|
+---+----+-----+------+------------+

