1. Spark Dataframe Basics

    1. Use the starter code above to create a pandas dataframe.
    1. Convert the pandas dataframe to a spark dataframe. From this point
       forward, do all of your work with the spark dataframe, not the pandas
       dataframe.
    1. Show the first 3 rows of the dataframe.
    1. Show the first 7 rows of the dataframe.
    1. View a summary of the data using `.describe`.
    1. Use `.select` to create a new dataframe with just the `n` and `abool`
       columns. View the first 5 rows of this dataframe.
    1. Use `.select` to create a new dataframe with just the `group` and `abool`
       columns. View the first 5 rows of this dataframe.
    1. Use `.select` to create a new dataframe with the `group` column and the
       `abool` column renamed to `a_boolean_value`. Show the first 3 rows of
       this dataframe.
    1. Use `.select` to create a new dataframe with the `group` column and the
       `n` column renamed to `a_numeric_value`. Show the first 6 rows of this
       dataframe.

1. Column Manipulation

    1. Use the starter code above to re-create a spark dataframe. Store the
       spark dataframe in a varaible named `df`

    1. Use `.select` to add 4 to the `n` column. Show the results.

    1. Subtract 5 from the `n` column and view the results.

    1. Multiply the `n` column by 2. View the results along with the original
       numbers.

    1. Add a new column named `n2` that is the `n` value multiplied by -1. Show
       the first 4 rows of your dataframe. You should see the original `n` value
       as well as `n2`.

    1. Add a new column named `n3` that is the n value squared. Show the first 5
       rows of your dataframe. You should see both `n`, `n2`, and `n3`.

    1. What happens when you run the code below?

        ```python
        df.group + df.abool
        ```

    1. What happens when you run the code below? What is the difference between
       this and the previous code sample?

        ```python
        df.select(df.group + df.abool)
        ```

    1. Try adding various other columns together. What are the results of
       combining the different data types?

1. Spark SQL

    1. Use the starter code above to re-create a spark dataframe.
    1. Turn your dataframe into a table that can be queried with spark SQL. Name
       the table `my_df`. Answer the rest of the questions in this section with
       a spark sql query (`spark.sql`) against `my_df`. After each step, view
       the first 7 records from the dataframe.
    1. Write a query that shows all of the columns from your dataframe.
    1. Write a query that shows just the `n` and `abool` columns from the
       dataframe.
    1. Write a query that shows just the `n` and `group` columns. Rename the
       `group` column to `g`.
    1. Write a query that selects `n`, and creates two new columns: `n2`, the
       original `n` values halved, and `n3`: the original n values minus 1.
    1. What happens if you make a SQL syntax error in your query?

1. Type casting

    1. Use the starter code above to re-create a spark dataframe.

    1. Use `.printSchema` to view the datatypes in your dataframe.

    1. Use `.dtypes` to view the datatypes in your dataframe.

    1. What is the difference between the two code samples below?

        ```python
        df.abool.cast('int')
        ```

        ```python
        df.select(df.abool.cast('int')).show()
        ```

    1. Use `.select` and `.cast` to convert the `abool` column to an integer
       type. View the results.
    1. Convert the `group` column to a integer data type and view the results.
       What happens?
    1. Convert the `n` column to a integer data type and view the results. What
       happens?
    1. Convert the `abool` column to a string data type and view the results.
       What happens?

1. Built-in Functions

    1. Use the starter code above to re-create a spark dataframe.
    1. Import the necessary functions from `pyspark.sql.functions`
    1. Find the highest `n` value.
    1. Find the lowest `n` value.
    1. Find the average `n` value.
    1. Use `concat` to change the `group` column to say, e.g. "Group: x" or
       "Group: y"
    1. Use `concat` to combine the `n` and `group` columns to produce results
       that look like this: "x: -1.432" or "z: 2.352"

1. Filter / Where

    1. Use the starter code above to re-create a spark dataframe.
    1. Use `.filter` or `.where` to select just the rows where the group is `y`
       and view the results.
    1. Select just the columns where the `abool` column is false and view the
       results.
    1. Find the columns where the `group` column is *not* `y`.
    1. Find the columns where `n` is positive.
    1. Find the columns where `abool` is true and the `group` column is `z`.
    1. Find the columns where `abool` is true or the `group` column is `z`.
    1. Find the columns where `abool` is false and `n` is less than 1
    1. Find the columns where `abool` is false or `n` is less than 1

1. When / Otherwise

    1. Use the starter code above to re-create a spark dataframe.
    1. Use `when` and `.otherwise` to create a column that contains the text "It
       is true" when `abool` is true and "It is false"" when `abool` is false.
    1. Create a column that contains 0 if n is less than 0, otherwise, the
       original n value.

1. Sorting

    1. Use the starter code above to re-create a spark dataframe.
    1. Sort by the `n` value.
    1. Sort by the `group` value, both ascending and descending.
    1. Sort by the group value first, then, within each group, sort by `n`
       value.
    1. Sort by `abool`, `group`, and `n`. Does it matter in what order you
       specify the columns when sorting?

1. Aggregating

    1. What is the average `n` value for each group in the `group` column?
    1. What is the maximum `n` value for each group in the `group` column?
    1. What is the minimum `n` value by `abool`?
    1. What is the average `n` value for each unique combination of the `group`
       and `abool` column?

Spark Dataframe Basics

Use the starter code above to create a pandas dataframe.

Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark 

dataframe, not the pandas dataframe.

Show the first 3 rows of the dataframe.

Show the first 7 rows of the dataframe.

View a summary of the data using .describe.

Use .select to create a new dataframe with just the n and abool columns. View the first 5 rows of this dataframe.

Use .select to create a new dataframe with just the group and abool columns. View the first 5 rows of this dataframe.

Use .select to create a new dataframe with the group column and the abool column renamed to a_boolean_value. Show the 
first 3 rows of this dataframe.

Use .select to create a new dataframe with the group column and the n column renamed to a_numeric_value. Show the 
first 6 rows of this dataframe.

In [1]:
import pandas as pd
import numpy as np
import pyspark

In [2]:
pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

In [3]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pandas_dataframe)

In [4]:
df.show(3)

+-------------------+-----+-----+
|                  n|group|abool|
+-------------------+-----+-----+
|-0.5935304728163509|    y|false|
|-0.6537654714845256|    z| true|
| 1.1565401871181058|    y|false|
+-------------------+-----+-----+
only showing top 3 rows



In [5]:
df.show(7)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| -0.5935304728163509|    y|false|
| -0.6537654714845256|    z| true|
|  1.1565401871181058|    y|false|
|-0.23502098371277275|    y| true|
| -0.7460054809035613|    z| true|
| -0.5952652598384565|    z|false|
|  0.9394948411503021|    y| true|
+--------------------+-----+-----+
only showing top 7 rows



In [6]:
df.describe().show()

+-------+-------------------+-----+
|summary|                  n|group|
+-------+-------------------+-----+
|  count|                 20|   20|
|   mean|0.14878403127458903| null|
| stddev|  0.976846008620517| null|
|    min|   -2.1324060166211|    x|
|    max| 1.6391310403186294|    z|
+-------+-------------------+-----+



In [7]:
df_n_abool = df.select('n','abool')
df_n_abool.show(5)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
| -0.5935304728163509|false|
| -0.6537654714845256| true|
|  1.1565401871181058|false|
|-0.23502098371277275| true|
| -0.7460054809035613| true|
+--------------------+-----+
only showing top 5 rows



In [8]:
df_group_abool = df.select('group','abool')
df_group_abool.show(5)

+-----+-----+
|group|abool|
+-----+-----+
|    y|false|
|    z| true|
|    y|false|
|    y| true|
|    z| true|
+-----+-----+
only showing top 5 rows



In [9]:
df_group_abool = df.select('group', df.abool.alias('True'))
df_group_abool.show(3)

+-----+-----+
|group| True|
+-----+-----+
|    y|false|
|    z| true|
|    y|false|
+-----+-----+
only showing top 3 rows



In [10]:
df_group_abool = df.select(df.group.alias('1'),df.abool.alias('2'))
df_group_abool.show(6)

+---+-----+
|  1|    2|
+---+-----+
|  y|false|
|  z| true|
|  y|false|
|  y| true|
|  z| true|
|  z|false|
+---+-----+
only showing top 6 rows



1. Column Manipulation

    1. Use the starter code above to re-create a spark dataframe. Store the
       spark dataframe in a varaible named `df`

    1. Use `.select` to add 4 to the `n` column. Show the results.

    1. Subtract 5 from the `n` column and view the results.

    1. Multiply the `n` column by 2. View the results along with the original
       numbers.

    1. Add a new column named `n2` that is the `n` value multiplied by -1. Show
       the first 4 rows of your dataframe. You should see the original `n` value
       as well as `n2`.

    1. Add a new column named `n3` that is the n value squared. Show the first 5
       rows of your dataframe. You should see both `n`, `n2`, and `n3`.

    1. What happens when you run the code below?

        ```python
        df.group + df.abool
        ```

    1. What happens when you run the code below? What is the difference between
       this and the previous code sample?

        ```python
        df.select(df.group + df.abool)
        ```

    1. Try adding various other columns together. What are the results of
       combining the different data types?

In [11]:
from pyspark.sql.functions import col, expr

In [12]:
pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pandas_dataframe)

In [13]:
df.show(5)

+-------------------+-----+-----+
|                  n|group|abool|
+-------------------+-----+-----+
|  0.727988440509268|    z| true|
|0.28328764402352613|    y|false|
| 0.4854567124399692|    z| true|
|0.04009618088775551|    y| true|
| 0.1317996612178238|    y| true|
+-------------------+-----+-----+
only showing top 5 rows



In [14]:
df_add_4 = df.select(df.n + 4, df.group, df.abool)
df.show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|   0.727988440509268|    z| true|
| 0.28328764402352613|    y|false|
|  0.4854567124399692|    z| true|
| 0.04009618088775551|    y| true|
|  0.1317996612178238|    y| true|
|  0.2873562064464264|    z| true|
| -1.3851913877896898|    x|false|
|-0.32220016996103124|    y|false|
|  -1.576401336370809|    y|false|
| 0.18115222394868613|    x| true|
| -0.2350901263489596|    y|false|
|  -0.465898011124788|    x|false|
|  0.4136179604698437|    z| true|
|  1.1285645591978317|    y|false|
| 0.12859791674477922|    z| true|
| -0.3433205967783675|    y|false|
| -0.8699232728719363|    x|false|
|  1.2787735724336267|    x|false|
| -1.2461674757326568|    x|false|
| 0.29884689737797465|    x|false|
+--------------------+-----+-----+



In [15]:
df_sub_5 = df.select(df.n - 5, df.group, df.abool)
df.show(5)

+-------------------+-----+-----+
|                  n|group|abool|
+-------------------+-----+-----+
|  0.727988440509268|    z| true|
|0.28328764402352613|    y|false|
| 0.4854567124399692|    z| true|
|0.04009618088775551|    y| true|
| 0.1317996612178238|    y| true|
+-------------------+-----+-----+
only showing top 5 rows



In [16]:
df_X2 = df.select(df.n * 2, df.n)
df_X2.show(5)

+-------------------+-------------------+
|            (n * 2)|                  n|
+-------------------+-------------------+
|  1.455976881018536|  0.727988440509268|
| 0.5665752880470523|0.28328764402352613|
| 0.9709134248799384| 0.4854567124399692|
|0.08019236177551102|0.04009618088775551|
| 0.2635993224356476| 0.1317996612178238|
+-------------------+-------------------+
only showing top 5 rows



1. Spark SQL

    1. Use the starter code above to re-create a spark dataframe.
    1. Turn your dataframe into a table that can be queried with spark SQL. Name
       the table `my_df`. Answer the rest of the questions in this section with
       a spark sql query (`spark.sql`) against `my_df`. After each step, view
       the first 7 records from the dataframe.
    1. Write a query that shows all of the columns from your dataframe.
    1. Write a query that shows just the `n` and `abool` columns from the
       dataframe.
    1. Write a query that shows just the `n` and `group` columns. Rename the
       `group` column to `g`.
    1. Write a query that selects `n`, and creates two new columns: `n2`, the
       original `n` values halved, and `n3`: the original n values minus 1.
    1. What happens if you make a SQL syntax error in your query?

In [17]:
pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pandas_dataframe)

In [18]:
df.createOrReplaceTempView("my_df")

In [19]:
spark.sql('''SELECT * FROM my_df''').show(7)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  0.7458489961049468|    z|false|
| 0.06497295596977953|    y|false|
| 0.25012865092311865|    y|false|
|0.008169109603518434|    y| true|
| -0.7770347810255167|    z|false|
|-0.06365780544492623|    x|false|
|    0.07003372043635|    y| true|
+--------------------+-----+-----+
only showing top 7 rows



In [20]:
spark.sql('''SELECT n, abool
             FROM my_df''').show(7)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
|  0.7458489961049468|false|
| 0.06497295596977953|false|
| 0.25012865092311865|false|
|0.008169109603518434| true|
| -0.7770347810255167|false|
|-0.06365780544492623|false|
|    0.07003372043635| true|
+--------------------+-----+
only showing top 7 rows



In [21]:
spark.sql('''SELECT n, group AS g
             FROM my_df''').show(7)

+--------------------+---+
|                   n|  g|
+--------------------+---+
|  0.7458489961049468|  z|
| 0.06497295596977953|  y|
| 0.25012865092311865|  y|
|0.008169109603518434|  y|
| -0.7770347810255167|  z|
|-0.06365780544492623|  x|
|    0.07003372043635|  y|
+--------------------+---+
only showing top 7 rows



Write a query that selects n, and creates two new columns: n2, the original n values halved, and n3: the original n values minus 1.

In [22]:
spark.sql('''
          SELECT n, (n/2) AS 2n, (n-1) AS 3n
          FROM my_df
          ''').show(7)

+--------------------+--------------------+-------------------+
|                   n|                  2n|                 3n|
+--------------------+--------------------+-------------------+
|  0.7458489961049468|  0.3729244980524734|-0.2541510038950532|
| 0.06497295596977953| 0.03248647798488977|-0.9350270440302204|
| 0.25012865092311865| 0.12506432546155932|-0.7498713490768814|
|0.008169109603518434|0.004084554801759217|-0.9918308903964815|
| -0.7770347810255167|-0.38851739051275835|-1.7770347810255167|
|-0.06365780544492623|-0.03182890272246...|-1.0636578054449262|
|    0.07003372043635|   0.035016860218175|  -0.92996627956365|
+--------------------+--------------------+-------------------+
only showing top 7 rows



1. Type casting

    1. Use the starter code above to re-create a spark dataframe.

    1. Use `.printSchema` to view the datatypes in your dataframe.

    1. Use `.dtypes` to view the datatypes in your dataframe.

    1. What is the difference between the two code samples below?

        ```python
        df.abool.cast('int')
        ```

        ```python
        df.select(df.abool.cast('int')).show()
        ```

    1. Use `.select` and `.cast` to convert the `abool` column to an integer
       type. View the results.
    1. Convert the `group` column to a integer data type and view the results.
       What happens?
    1. Convert the `n` column to a integer data type and view the results. What
       happens?
    1. Convert the `abool` column to a string data type and view the results.
       What happens?

In [23]:
pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pandas_dataframe)

In [24]:
df.printSchema()

root
 |-- n: double (nullable = true)
 |-- group: string (nullable = true)
 |-- abool: boolean (nullable = true)



In [25]:
df.dtypes

[('n', 'double'), ('group', 'string'), ('abool', 'boolean')]

In [27]:
df.abool.cast('int')

Column<b'CAST(abool AS INT)'>

In [28]:
df.select(df.abool.cast('int')).show(5) # shows abool as int

+-----+
|abool|
+-----+
|    1|
|    0|
|    1|
|    1|
|    0|
+-----+
only showing top 5 rows



In [29]:
df.select(df.abool.cast('int')).show(5)

+-----+
|abool|
+-----+
|    1|
|    0|
|    1|
|    1|
|    0|
+-----+
only showing top 5 rows



In [30]:
df.select(df.group.cast('int')).show(5)

+-----+
|group|
+-----+
| null|
| null|
| null|
| null|
| null|
+-----+
only showing top 5 rows



In [31]:
df.select(df.n.cast('int')).show(5)

+---+
|  n|
+---+
|  0|
|  0|
|  0|
|  0|
| -1|
+---+
only showing top 5 rows



1. Built-in Functions

    1. Use the starter code above to re-create a spark dataframe.
    1. Import the necessary functions from `pyspark.sql.functions`
    1. Find the highest `n` value.
    1. Find the lowest `n` value.
    1. Find the average `n` value.
    1. Use `concat` to change the `group` column to say, e.g. "Group: x" or
       "Group: y"
    1. Use `concat` to combine the `n` and `group` columns to produce results
       that look like this: "x: -1.432" or "z: 2.352"

In [32]:
pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pandas_dataframe)

In [33]:
from pyspark.sql.functions import concat,sum,avg,min,max,lit

In [34]:
df.select(max(df.n),min(df.n),avg(df.n)).show()

+------------------+-------------------+------------------+
|            max(n)|             min(n)|            avg(n)|
+------------------+-------------------+------------------+
|2.8579310286325224|-1.7150504069178603|0.2956113849624353|
+------------------+-------------------+------------------+



In [35]:
df.select(concat(lit("Group: "), col('group'))).show(5)

+----------------------+
|concat(Group: , group)|
+----------------------+
|              Group: x|
|              Group: x|
|              Group: x|
|              Group: x|
|              Group: z|
+----------------------+
only showing top 5 rows



In [36]:
df.select(concat(col('group'),lit(':'),col('n'))).show(5)

+--------------------+
| concat(group, :, n)|
+--------------------+
|x:0.4920813422832545|
|x:0.6158480554333929|
|x:-1.715050406917...|
|x:-0.969704300304698|
|z:0.2884450182676466|
+--------------------+
only showing top 5 rows



1. Filter / Where

    1. Use the starter code above to re-create a spark dataframe.
    1. Use `.filter` or `.where` to select just the rows where the group is `y`
       and view the results.
    1. Select just the columns where the `abool` column is false and view the
       results.
    1. Find the columns where the `group` column is *not* `y`.
    1. Find the columns where `n` is positive.
    1. Find the columns where `abool` is true and the `group` column is `z`.
    1. Find the columns where `abool` is true or the `group` column is `z`.
    1. Find the columns where `abool` is false and `n` is less than 1
    1. Find the columns where `abool` is false or `n` is less than 1

In [37]:
pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pandas_dataframe)

In [38]:
df.filter((df.group=='y')).show(5)

+-------------------+-----+-----+
|                  n|group|abool|
+-------------------+-----+-----+
| 0.7258017690577546|    y| true|
|0.04869340042592003|    y| true|
| 0.4568236077825877|    y|false|
|-1.8988623219149174|    y| true|
|-0.6449311328844708|    y| true|
+-------------------+-----+-----+



In [39]:
df.filter((df.abool==False)).show(5)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| 0.06619528210571758|    x|false|
|  0.4568236077825877|    y|false|
|-0.11062430840489158|    x|false|
|  1.4849541406216085|    z|false|
|  0.5249092744030445|    z|false|
+--------------------+-----+-----+



In [40]:
df.filter((df.group!='y')).show(5)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|-0.19149878964595465|    z| true|
|  -1.756251409005848|    z| true|
|  0.7376332949773831|    z| true|
| -0.9995186526384939|    x| true|
| -0.5777529197731882|    z| true|
+--------------------+-----+-----+
only showing top 5 rows



In [41]:
df.filter((df.n>=0)).show(5)

+-------------------+-----+-----+
|                  n|group|abool|
+-------------------+-----+-----+
| 0.7376332949773831|    z| true|
|0.06619528210571758|    x|false|
| 0.7258017690577546|    y| true|
|0.04869340042592003|    y| true|
|  1.395471272874804|    x| true|
+-------------------+-----+-----+
only showing top 5 rows



In [42]:
df.filter(df.abool==True).where(df.group=='z').show(5)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|-0.19149878964595465|    z| true|
|  -1.756251409005848|    z| true|
|  0.7376332949773831|    z| true|
| -0.5777529197731882|    z| true|
|-0.21490021271307547|    z| true|
+--------------------+-----+-----+



In [43]:
df.filter((df.abool==True) | (df.group=='z')).show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|-0.19149878964595465|    z| true|
|  -1.756251409005848|    z| true|
|  0.7376332949773831|    z| true|
| -0.9995186526384939|    x| true|
| -0.5777529197731882|    z| true|
|  0.7258017690577546|    y| true|
| -0.8083872118669346|    x| true|
| 0.04869340042592003|    y| true|
|   1.395471272874804|    x| true|
| -0.7176677330788243|    x| true|
|  0.7600323885350998|    x| true|
|  1.4849541406216085|    z|false|
| -1.8988623219149174|    y| true|
| -0.8626565079722793|    x| true|
|  0.5249092744030445|    z|false|
|-0.21490021271307547|    z| true|
| -0.6449311328844708|    y| true|
+--------------------+-----+-----+



In [44]:
df.filter((df.abool==False) & (df.n < 1)).show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| 0.06619528210571758|    x|false|
|  0.4568236077825877|    y|false|
|-0.11062430840489158|    x|false|
|  0.5249092744030445|    z|false|
+--------------------+-----+-----+



In [45]:
df.filter((df.abool==False) | (df.n < 1)).show(5)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|-0.19149878964595465|    z| true|
|  -1.756251409005848|    z| true|
|  0.7376332949773831|    z| true|
| -0.9995186526384939|    x| true|
| -0.5777529197731882|    z| true|
+--------------------+-----+-----+
only showing top 5 rows



1. When / Otherwise

    1. Use the starter code above to re-create a spark dataframe.
    1. Use `when` and `.otherwise` to create a column that contains the text "It
       is true" when `abool` is true and "It is false"" when `abool` is false.
    1. Create a column that contains 0 if n is less than 0, otherwise, the
       original n value.

In [46]:
from pyspark.sql.functions import when

In [47]:
pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pandas_dataframe)

In [48]:
df.select(df.abool,when(df.abool==True, "It is True").otherwise("It is False").alias('Truth_String')).show(5)

+-----+------------+
|abool|Truth_String|
+-----+------------+
| true|  It is True|
| true|  It is True|
| true|  It is True|
| true|  It is True|
|false| It is False|
+-----+------------+
only showing top 5 rows



In [49]:
df.select(df.n,when(df.n < 0, 0).otherwise(df.n).alias('Positive_or_Zero')).show(10)

+--------------------+-------------------+
|                   n|   Positive_or_Zero|
+--------------------+-------------------+
| 0.14410354790123864|0.14410354790123864|
|  0.1425318830306353| 0.1425318830306353|
|-0.11753435775447621|                0.0|
|-0.05898879678452567|                0.0|
| 0.02907765554037852|0.02907765554037852|
|-0.37866952724621417|                0.0|
|  1.3118743067346454| 1.3118743067346454|
|   2.305869416373438|  2.305869416373438|
| 0.30323512962752536|0.30323512962752536|
|  -0.680256611380646|                0.0|
+--------------------+-------------------+
only showing top 10 rows



1. Sorting

    1. Use the starter code above to re-create a spark dataframe.
    1. Sort by the `n` value.
    1. Sort by the `group` value, both ascending and descending.
    1. Sort by the group value first, then, within each group, sort by `n`
       value.
    1. Sort by `abool`, `group`, and `n`. Does it matter in what order you
       specify the columns when sorting?


In [50]:
pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pandas_dataframe)

In [51]:
df.sort(col('n')).show(5)

+-------------------+-----+-----+
|                  n|group|abool|
+-------------------+-----+-----+
|-1.6862345883268082|    z|false|
|-1.5097845963470908|    y| true|
|-0.9401936269644735|    y|false|
|-0.8305550322712313|    x|false|
|-0.5833067771337954|    x|false|
+-------------------+-----+-----+
only showing top 5 rows



In [52]:
df.sort(col('group')).show(5)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  2.5429407553387087|    x| true|
|-0.01034994586005614|    x| true|
| -0.5617326120195298|    x| true|
| 0.06646960447774532|    x| true|
| -0.3528167116409259|    x| true|
+--------------------+-----+-----+
only showing top 5 rows



In [53]:
df.sort(col('group').desc()).show(5)

+-------------------+-----+-----+
|                  n|group|abool|
+-------------------+-----+-----+
|0.43231867971899207|    z|false|
| 1.1487581707355172|    z| true|
| -0.424131466150611|    z|false|
| 1.3990770458690214|    z|false|
|-1.6862345883268082|    z|false|
+-------------------+-----+-----+
only showing top 5 rows



In [54]:
df.sort(df.group, df.n).show(5)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| -0.8305550322712313|    x|false|
| -0.5833067771337954|    x|false|
| -0.5617326120195298|    x| true|
|-0.40570448185281643|    x|false|
| -0.3528167116409259|    x| true|
+--------------------+-----+-----+
only showing top 5 rows



In [55]:
df.sort(df.abool, df.group, df.n).show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| -0.8305550322712313|    x|false|
| -0.5833067771337954|    x|false|
|-0.40570448185281643|    x|false|
|-0.00618632342440...|    x|false|
| -0.9401936269644735|    y|false|
|  0.1357210885660363|    y|false|
| 0.47037096676466916|    y|false|
| -1.6862345883268082|    z|false|
|  -0.424131466150611|    z|false|
| 0.43231867971899207|    z|false|
|  1.3990770458690214|    z|false|
| -0.5617326120195298|    x| true|
| -0.3528167116409259|    x| true|
|-0.12370316213339579|    x| true|
|-0.01034994586005614|    x| true|
| 0.06646960447774532|    x| true|
|  2.5429407553387087|    x| true|
| -1.5097845963470908|    y| true|
| -0.1140648455660212|    y| true|
|  1.1487581707355172|    z| true|
+--------------------+-----+-----+



In [56]:
df.sort(df.group, df.abool, df.n).show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| -0.8305550322712313|    x|false|
| -0.5833067771337954|    x|false|
|-0.40570448185281643|    x|false|
|-0.00618632342440...|    x|false|
| -0.5617326120195298|    x| true|
| -0.3528167116409259|    x| true|
|-0.12370316213339579|    x| true|
|-0.01034994586005614|    x| true|
| 0.06646960447774532|    x| true|
|  2.5429407553387087|    x| true|
| -0.9401936269644735|    y|false|
|  0.1357210885660363|    y|false|
| 0.47037096676466916|    y|false|
| -1.5097845963470908|    y| true|
| -0.1140648455660212|    y| true|
| -1.6862345883268082|    z|false|
|  -0.424131466150611|    z|false|
| 0.43231867971899207|    z|false|
|  1.3990770458690214|    z|false|
|  1.1487581707355172|    z| true|
+--------------------+-----+-----+



1. Aggregating

    1. What is the average `n` value for each group in the `group` column?
    1. What is the maximum `n` value for each group in the `group` column?
    1. What is the minimum `n` value by `abool`?
    1. What is the average `n` value for each unique combination of the `group`
       and `abool` column?

In [57]:
pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pandas_dataframe)

In [58]:
df.groupBy(df.group).agg(avg(df.n)).show()

+-----+--------------------+
|group|              avg(n)|
+-----+--------------------+
|    x| 0.33992568707345344|
|    z|0.057891308286099265|
|    y|-0.03014774155708...|
+-----+--------------------+



In [59]:
df.groupBy(df.group).agg(max(df.n)).show()

+-----+------------------+
|group|            max(n)|
+-----+------------------+
|    x|0.5349815199772688|
|    z|2.1852914632153926|
|    y| 1.135688982685717|
+-----+------------------+



In [60]:
df.groupBy(df.group,df.abool).agg(avg(df.n)).show()

+-----+-----+--------------------+
|group|abool|              avg(n)|
+-----+-----+--------------------+
|    z|false| 0.28103916312352434|
|    y|false| -0.8650616323073733|
|    y| true|  0.1785807311304849|
|    x|false|-0.00800202389283...|
|    x| true|  0.5138895425565977|
|    z| true|-0.05368261913261...|
+-----+-----+--------------------+



In [68]:
df.rollup(df.group).agg(avg(df.n)).show()

+-----+--------------------+
|group|              avg(n)|
+-----+--------------------+
| null| 0.07818670264340587|
|    z|0.057891308286099265|
|    y|-0.03014774155708...|
|    x| 0.33992568707345344|
+-----+--------------------+

