#### Exercise 2.2
Given the following data frame, programmatically count the number of columns that aren’t strings (answer = only one column isn’t a string).
createDataFrame() allows you to create a data frame from a variety of sources,such as a pandas data frame or (in this case) a list of lists.

In [0]:
exo2_2_df = spark.createDataFrame(
    [
        ["test_1", "more test 1", 10], 
        ["test_2", "more test 2", 11], 
        ["test_3", "more test 3", 12]
    ], 
    ["one", "two", "three"]
)

In [0]:
exo2_2_df.printSchema()
exo2_2_df.show(5)


In [0]:
print(len([x for x, y in exo2_2_df.dtypes if y != "string"]))

Answer: The column "Three" isn't string is long type.

#### Exercise 2.3
Rewrite the following code snippet, removing the withColumnRenamed method. Which
version is clearer and easier to read?

In [0]:
# The `length` function returns the number of characters in a string column.

# Parameters
path_book = "/Volumes/workspace/dataanalysispysparkbook/bronze_files/1342-0.txt"

from pyspark.sql.functions import col, length

exo2_3_df = (spark.read.text(path_book)
                        .select(length(col("value")))
                        .withColumnRenamed("length(value)", "number_of_char")
            )
exo2_3_df.printSchema()
exo2_3_df.show(5, truncate=False)            

In [0]:
rewrite_df = (spark.read.text(path_book)
                        .select(length(col("value")).alias("number_of_char"))
             )
rewrite_df.printSchema()
rewrite_df.show(5, truncate=False)

#### Exercise 2.4
Assume a data frame exo2_4_df. The following code block gives an error. What is the
problem, and how can you solve it?
from pyspark.sql.functions import col, greatest

In [0]:
exo2_4_df = spark.createDataFrame(
    [
      ["key", 10_000, 20_000]
    ], 
    ["key", "value1", "value2"]
)

from pyspark.sql.functions import col, greatest
from pyspark.sql.utils import AnalysisException

exo2_4_df.printSchema()
# root
# |-- key: string (containsNull = true)
# |-- value1: long (containsNull = true)
# |-- value2: long (containsNull = true)
# `greatest` will return the greatest value of the list of column names,
# skipping null value

# The following statement will return an error
try:
    exo2_4_mod = exo2_4_df.select(
                        greatest(col("value1"), col("value2")).alias("maximum_value")
    ).select("key", "max_value")

    exo2_4_mod.show(5)
except AnalysisException as err:
    print(err)

In [0]:
# Añadir la columna con el valor máximo y mantener la columna "key"
exo2_4_mod = exo2_4_df.select(
    col("key"),
    greatest(col("value1"), col("value2")).alias("max_value")
)

exo2_4_mod.show(5)

#### Exercise 2.5
Let’s take our words_nonull data frame, available in the next listing. You can use the code from the repository (code/Ch02/end_of_chapter.py) in your REPL to get the data frame loaded.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, split, explode, lower, regexp_extract

book = spark.read.text(path_book)
lines = book.select(split(book.value, " ").alias("line"))
words = lines.select(explode(col("line")).alias("word"))
words_lower = words.select(lower(col("word")).alias("word_lower"))
words_clean = words_lower.select(
  regexp_extract(col("word_lower"), "[a-z]*", 0).alias("word")
)
words_nonull = words_clean.where(col("word") == "")
words_nonull.show(5)

a) Remove all of the occurrences of the word is.

In [0]:
words_clean.filter(col("word") == "is").show(5)
words_clean_is = words_clean.replace("is", "")
words_clean_is.filter(col("word") == "is").show(5)
words_clean_is.show(5)


b) (Challenge) Using the length function, keep only the words with more than three characters.

In [0]:
from pyspark.sql.functions import col, length

words_three_letters = words_clean.filter(length(col("word")) >= 3)
words_three_letters.show(5)

#### Exercise 2.6
The where clause takes a Boolean expression over one or many columns to filter the data frame. Beyond the usual Boolean operators (>, <, ==, <=, >=, !=), PySpark provides other functions that return Boolean columns in the pyspark.sql.functions module.

A good example is the isin() method (applied on a Column object, like col(…).isin(…)), which takes a list of values as a parameter, and will return only the records where the value in the column equals a member of the list.

Let’s say you want to remove the words is, not, the and if from your list of words, using a single where() method on the words_nonull data frame. Write the code to do so.

In [0]:
stopwords = ["no", "is", "the", "if"]

words_filtered = (
    words_clean
    .filter(col("word").isNotNull())               # Excluir nulls
    .filter(col("word") != "")                     # Excluir vacías
    .filter(~col("word").isin(stopwords))          # Excluir stopwords
)
words_no_is_not_the_if.show(5)        

#### Exercise 2.7
One of your friends comes to you with the following code. They have no idea why it
doesn’t work. Can you diagnose the problem in the try block, explain why it is an
error, and provide a fix?

In [0]:
from pyspark.sql.functions import col, split

try:
    book = spark.read.text(path_book)
    # book = book.printSchema()
    lines = book.select(split(book.value, " ").alias("line"))
    words = lines.select(explode(col("line")).alias("word"))
    words.show(5)
except AnalysisException as err:
    print(err)