In [1]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np

np.random.seed(13)

pandas_dataframe = pd.DataFrame({
    "n": np.random.randn(20),
    "group": np.random.choice(list("xyz"), 20),
    "abool": np.random.choice([True, False], 20),
})

1. Spark Dataframe Basics

    1. Use the starter code above to create a pandas dataframe.
    1. Convert the pandas dataframe to a spark dataframe. From this point
       forward, do all of your work with the spark dataframe, not the pandas
       dataframe.
    1. Show the first 3 rows of the dataframe.
    1. Show the first 7 rows of the dataframe.
    1. View a summary of the data using `.describe`.
    1. Use `.select` to create a new dataframe with just the `n` and `abool`
       columns. View the first 5 rows of this dataframe.
    1. Use `.select` to create a new dataframe with just the `group` and `abool`
       columns. View the first 5 rows of this dataframe.
    1. Use `.select` to create a new dataframe with the `group` column and the
       `abool` column renamed to `a_boolean_value`. Show the first 3 rows of
       this dataframe.
    1. Use `.select` to create a new dataframe with the `group` column and the
       `n` column renamed to `a_numeric_value`. Show the first 6 rows of this
       dataframe.
       

In [2]:
# convert to spark dataframe
import pyspark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pandas_dataframe)
df.head(2)

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/11/05 10:08:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

[Row(n=-0.712390662050588, group='z', abool=False),
 Row(n=0.753766378659703, group='x', abool=False)]

In [3]:
# first three rows, 7 rows, describe
df.describe()

                                                                                

DataFrame[summary: string, n: string, group: string]

In [4]:
# .select operations
n_abool_df = df.select(df.n.alias('number'),df.abool.alias('a_boolean_value'))
n_abool_df.head(5)

[Row(number=-0.712390662050588, a_boolean_value=False),
 Row(number=0.753766378659703, a_boolean_value=False),
 Row(number=-0.044503078338053455, a_boolean_value=False),
 Row(number=0.45181233874578974, a_boolean_value=False),
 Row(number=1.3451017084510097, a_boolean_value=False)]

2. Column Manipulation

    1. Use the starter code above to re-create a spark dataframe. Store the
       spark dataframe in a varaible named `df`

    1. Use `.select` to add 4 to the `n` column. Show the results.

    1. Subtract 5 from the `n` column and view the results.

    1. Multiply the `n` column by 2. View the results along with the original
       numbers.

    1. Add a new column named `n2` that is the `n` value multiplied by -1. Show
       the first 4 rows of your dataframe. You should see the original `n` value
       as well as `n2`.

    1. Add a new column named `n3` that is the n value squared. Show the first 5
       rows of your dataframe. You should see both `n`, `n2`, and `n3`.

    1. What happens when you run the code below?

        ```python
        df.group + df.abool
        ```

    1. What happens when you run the code below? What is the difference between
       this and the previous code sample?

        ```python
        df.select(df.group + df.abool)
        ```

    1. Try adding various other columns together. What are the results of
       combining the different data types?

In [5]:
# add, subtract, multiply
df.select('*', (df.n + 4).alias('add_4')).head(2)

[Row(n=-0.712390662050588, group='z', abool=False, add_4=3.2876093379494122),
 Row(n=0.753766378659703, group='x', abool=False, add_4=4.753766378659703)]

In [6]:
df.group + df.abool

Column<'(group + abool)'>

3. Type casting

    1. Use the starter code above to re-create a spark dataframe.

    1. Use `.printSchema` to view the datatypes in your dataframe.

    1. Use `.dtypes` to view the datatypes in your dataframe.

    1. What is the difference between the two code samples below?

        ```python
        df.abool.cast('int')
        ```

        ```python
        df.select(df.abool.cast('int')).show()
        ```

    1. Use `.select` and `.cast` to convert the `abool` column to an integer
       type. View the results.
    1. Convert the `group` column to a integer data type and view the results.
       What happens?
    1. Convert the `n` column to a integer data type and view the results. What
       happens?
    1. Convert the `abool` column to a string data type and view the results.
       What happens?

In [7]:
df.select(df.abool.cast('string')).head(2)

[Row(abool='false'), Row(abool='false')]

4. Built-in Functions

    1. Use the starter code above to re-create a spark dataframe.
    1. Import the necessary functions from `pyspark.sql.functions`
    1. Find the highest `n` value.
    1. Find the lowest `n` value.
    1. Find the average `n` value.
    1. Use `concat` to change the `group` column to say, e.g. "Group: x" or
       "Group: y"
    1. Use `concat` to combine the `n` and `group` columns to produce results
       that look like this: "x: -1.432" or "z: 2.352"

In [8]:
from pyspark.sql.functions import max, min, mean, concat, lit
df.select(max(df.n)).show()

+------------------+
|            max(n)|
+------------------+
|2.1503829673811126|
+------------------+



In [9]:
df.select(concat(lit(df.group), lit(': '), df.n).alias('group_vals')).head(3)

[Row(group_vals='z: -0.712390662050588'),
 Row(group_vals='x: 0.753766378659703'),
 Row(group_vals='z: -0.044503078338053455')]

5. When / Otherwise

    1. Use the starter code above to re-create a spark dataframe.
    1. Use `when` and `.otherwise` to create a column that contains the text "It
       is true" when `abool` is true and "It is false"" when `abool` is false.
    1. Create a column that contains 0 if n is less than 0, otherwise, the
       original n value.

In [10]:
# B
from pyspark.sql.functions import when
df.select('abool', 
          when(df.abool, 'It is true').alias('truths'), 
          when(~ df.abool, 'It is false').alias('falsths')).show()

+-----+----------+-----------+
|abool|    truths|    falsths|
+-----+----------+-----------+
|false|      null|It is false|
|false|      null|It is false|
|false|      null|It is false|
|false|      null|It is false|
|false|      null|It is false|
|false|      null|It is false|
|false|      null|It is false|
|false|      null|It is false|
| true|It is true|       null|
| true|It is true|       null|
|false|      null|It is false|
|false|      null|It is false|
| true|It is true|       null|
| true|It is true|       null|
|false|      null|It is false|
|false|      null|It is false|
|false|      null|It is false|
| true|It is true|       null|
|false|      null|It is false|
| true|It is true|       null|
+-----+----------+-----------+



In [11]:
# C
df.select(when(df.n < 0, 0).otherwise(df.n).alias('0 or greater')).head(3)

[Row(0 or greater=0.0),
 Row(0 or greater=0.753766378659703),
 Row(0 or greater=0.0)]

6. Filter / Where

    1. Use the starter code above to re-create a spark dataframe.
    1. Use `.filter` or `.where` to select just the rows where the group is `y`
       and view the results.
    1. Select just the columns where the `abool` column is false and view the
       results.
    1. Find the columns where the `group` column is *not* `y`.
    1. Find the columns where `n` is positive.
    1. Find the columns where `abool` is true and the `group` column is `z`.
    1. Find the columns where `abool` is true or the `group` column is `z`.
    1. Find the columns where `abool` is false and `n` is less than 1
    1. Find the columns where `abool` is false or `n` is less than 1

In [12]:
df.where((df.group != 'y') & (df.abool)).head(3)

[Row(n=1.4786857374358966, group='z', abool=True),
 Row(n=-0.026771649986440726, group='x', abool=True)]

7. Sorting

    1. Use the starter code above to re-create a spark dataframe.
    1. Sort by the `n` value.
    1. Sort by the `group` value, both ascending and descending.
    1. Sort by the group value first, then, within each group, sort by `n`
       value.
    1. Sort by `abool`, `group`, and `n`. Does it matter in what order you
       specify the columns when sorting?

In [13]:
df.sort(df.group.asc(), df.abool.desc()).head(4)

[Row(n=-0.026771649986440726, group='x', abool=True),
 Row(n=0.753766378659703, group='x', abool=False),
 Row(n=-0.7889890249515489, group='x', abool=False),
 Row(n=0.6062886568962988, group='x', abool=False)]

8. Spark SQL

    1. Use the starter code above to re-create a spark dataframe.
    1. Turn your dataframe into a table that can be queried with spark SQL. Name
       the table `my_df`. Answer the rest of the questions in this section with
       a spark sql query (`spark.sql`) against `my_df`. After each step, view
       the first 7 records from the dataframe.
    1. Write a query that shows all of the columns from your dataframe.
    1. Write a query that shows just the `n` and `abool` columns from the
       dataframe.
    1. Write a query that shows just the `n` and `group` columns. Rename the
       `group` column to `g`.
    1. Write a query that selects `n`, and creates two new columns: `n2`, the
       original `n` values halved, and `n3`: the original n values minus 1.
    1. What happens if you make a SQL syntax error in your query?

In [14]:
# inital creation
df.createOrReplaceTempView('df')
spark.sql(''' SELECT * FROM df ''').head(3)

[Row(n=-0.712390662050588, group='z', abool=False),
 Row(n=0.753766378659703, group='x', abool=False),
 Row(n=-0.044503078338053455, group='z', abool=False)]

In [15]:
# SQL queries
spark.sql(""" select n, (n / 2) as n2, (n - 1) as n3 from df """).show(7)

+--------------------+--------------------+--------------------+
|                   n|                  n2|                  n3|
+--------------------+--------------------+--------------------+
|  -0.712390662050588|  -0.356195331025294|  -1.712390662050588|
|   0.753766378659703|  0.3768831893298515|-0.24623362134029703|
|-0.04450307833805...|-0.02225153916902...| -1.0445030783380536|
| 0.45181233874578974| 0.22590616937289487| -0.5481876612542103|
|  1.3451017084510097|  0.6725508542255049| 0.34510170845100974|
|  0.5323378882945463| 0.26616894414727316| -0.4676621117054537|
|  1.3501878997225267|  0.6750939498612634| 0.35018789972252673|
+--------------------+--------------------+--------------------+
only showing top 7 rows



9. Aggregating

    1. What is the average `n` value for each group in the `group` column?
    1. What is the maximum `n` value for each group in the `group` column?
    1. What is the minimum `n` value by `abool`?
    1. What is the average `n` value for each unique combination of the `group`
       and `abool` column?

In [16]:
df.groupBy('group').agg(mean(df.n)).show()

[Stage 26:>                                                         (0 + 8) / 8]

+-----+------------------+
|group|            avg(n)|
+-----+------------------+
|    x|0.2871427762539448|
|    z| 0.590730814237962|
|    y| 0.257601419602374|
+-----+------------------+



                                                                                

In [17]:
df.groupBy('group','abool').agg(mean(df.n)).show()

+-----+-----+--------------------+
|group|abool|              avg(n)|
+-----+-----+--------------------+
|    z|false| 0.41313982959837514|
|    x|false|  0.3499256615020219|
|    y|false| 0.15907124664523611|
|    y| true| 0.35613159255951177|
|    z| true|  1.4786857374358966|
|    x| true|-0.02677164998644...|
+-----+-----+--------------------+



Revisit the exercises for the [pandas dataframes lesson][1] and [the advanced
dataframes lesson][2]. Complete the exercises, but convert the pandas dataframes
to spark dataframes first in order to practice using the spark api.

[1]: https://ds.codeup.com/python/dataframes/
[2]: https://ds.codeup.com/python/advanced-dataframes/