In [1]:
import pandas as pd
import numpy as np

np.random.seed(13)

pandas_dataframe = pd.DataFrame({
    "n": np.random.randn(20),
    "group": np.random.choice(list("xyz"), 20),
    "abool": np.random.choice([True, False], 20),
})

**1. Spark Dataframe Basics**

- Use the starter code above to create a pandas dataframe.
- Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark dataframe, not the pandas dataframe.
- Show the first 3 rows of the dataframe.
- Show the first 7 rows of the dataframe.
- View a summary of the data using .describe.
- Use .select to create a new dataframe with just the n and abool columns. View the first 5 rows of this dataframe.
- Use .select to create a new dataframe with just the group and abool columns. View the first 5 rows of this dataframe.
- Use .select to create a new dataframe with the group column and the abool column renamed to a_boolean_value. Show the first 3 rows of this dataframe.
- Use .select to create a new dataframe with the group column and the n column renamed to a_numeric_value. Show the first 6 rows of this dataframe.

In [2]:
# import pyspark
import pyspark

spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame(pandas_dataframe)

In [3]:
df.show(3)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
+--------------------+-----+-----+
only showing top 3 rows



In [4]:
df.show(7)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
+--------------------+-----+-----+
only showing top 7 rows



In [5]:
df.describe, \
df.describe()

(<bound method DataFrame.describe of DataFrame[n: double, group: string, abool: boolean]>,
 DataFrame[summary: string, n: string, group: string])

In [6]:
# select the columns
df.select('n', 'abool').show(5)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
|  -0.712390662050588|false|
|   0.753766378659703|false|
|-0.04450307833805...|false|
| 0.45181233874578974|false|
|  1.3451017084510097|false|
+--------------------+-----+
only showing top 5 rows



In [7]:
# select the columns
df.select('group', 'abool').show(5)

+-----+-----+
|group|abool|
+-----+-----+
|    z|false|
|    x|false|
|    z|false|
|    y|false|
|    z|false|
+-----+-----+
only showing top 5 rows



In [8]:
# create a var that holds abool values
a_boolean_value = df.abool

# call new var and give it temp name 'a_boolean_value'
df.select('group', a_boolean_value.alias('a_boolean_value')).show(5)

+-----+---------------+
|group|a_boolean_value|
+-----+---------------+
|    z|          false|
|    x|          false|
|    z|          false|
|    y|          false|
|    z|          false|
+-----+---------------+
only showing top 5 rows



In [9]:
# create a var that holds abool values
a_numeric_value = df.n

# call new var and give it temp name 'a_numeric_value'
df.select('group', a_numeric_value.alias('a_numeric_value')).show(6)

+-----+--------------------+
|group|     a_numeric_value|
+-----+--------------------+
|    z|  -0.712390662050588|
|    x|   0.753766378659703|
|    z|-0.04450307833805...|
|    y| 0.45181233874578974|
|    z|  1.3451017084510097|
|    y|  0.5323378882945463|
+-----+--------------------+
only showing top 6 rows



---
### 2. Column Manipulation

- Use the starter code above to re-create a spark dataframe. Store the spark dataframe in a varaible named df

- Use .select to add 4 to the n column. Show the results.

- Subtract 5 from the n column and view the results.

- Multiply the n column by 2. View the results along with the original numbers.

- Add a new column named n2 that is the n value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original n value as well as n2.

- Add a new column named n3 that is the n value squared. Show the first 5 rows of your dataframe. You should see both n, n2, and n3.

- What happens when you run the code below?

> df.group + df.abool
- What happens when you run the code below? What is the difference between this and the previous code sample?

> df.select(df.group + df.abool)
- Try adding various other columns together. What are the results of combining the different data types?

In [10]:
df.show(10)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
|  0.8612113741693206|    x|false|
|  1.4786857374358966|    z| true|
| -1.0453771305385342|    y| true|
+--------------------+-----+-----+
only showing top 10 rows



In [11]:
df.select(df.n + 4).show(10)

+------------------+
|           (n + 4)|
+------------------+
|3.2876093379494122|
| 4.753766378659703|
|3.9554969216619464|
|  4.45181233874579|
|5.3451017084510095|
| 4.532337888294546|
| 5.350187899722527|
|  4.86121137416932|
| 5.478685737435897|
| 2.954622869461466|
+------------------+
only showing top 10 rows



In [12]:
df.select(df.n - 5).show(10)

+-------------------+
|            (n - 5)|
+-------------------+
| -5.712390662050588|
| -4.246233621340297|
| -5.044503078338053|
|  -4.54818766125421|
|-3.6548982915489905|
| -4.467662111705454|
|-3.6498121002774733|
|  -4.13878862583068|
| -3.521314262564103|
| -6.045377130538534|
+-------------------+
only showing top 10 rows



In [13]:
df.select(
    'n',
    df.n * 2
).show()

+--------------------+--------------------+
|                   n|             (n * 2)|
+--------------------+--------------------+
|  -0.712390662050588|  -1.424781324101176|
|   0.753766378659703|   1.507532757319406|
|-0.04450307833805...|-0.08900615667610691|
| 0.45181233874578974|  0.9036246774915795|
|  1.3451017084510097|  2.6902034169020195|
|  0.5323378882945463|  1.0646757765890926|
|  1.3501878997225267|  2.7003757994450535|
|  0.8612113741693206|  1.7224227483386412|
|  1.4786857374358966|   2.957371474871793|
| -1.0453771305385342| -2.0907542610770684|
| -0.7889890249515489| -1.5779780499030978|
|  -1.261605945319069|  -2.523211890638138|
|  0.5628467852810314|  1.1256935705620628|
|-0.24332625188556253|-0.48665250377112507|
|  0.9137407048596775|   1.827481409719355|
| 0.31735092273633597|  0.6347018454726719|
| 0.12730328020698067| 0.25460656041396135|
|  2.1503829673811126|   4.300765934762225|
|  0.6062886568962988|  1.2125773137925977|
|-0.02677164998644...|-0.0535432

Add a new column named n2 that is the n value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original n value as well as n2.

In [18]:
n2 = df.n * (-1)
df.select(
    'n',
    n2.alias('n2')
).show(4)

+--------------------+--------------------+
|                   n|                  n2|
+--------------------+--------------------+
|  -0.712390662050588|   0.712390662050588|
|   0.753766378659703|  -0.753766378659703|
|-0.04450307833805...|0.044503078338053455|
| 0.45181233874578974|-0.45181233874578974|
+--------------------+--------------------+
only showing top 4 rows



Add a new column named n3 that is the n value squared. Show the first 5 rows of your dataframe. You should see both n, n2, and n3.

In [None]:
df

In [17]:
n3 = df.n**2
df.select(
    'n',
    n2.alias('n2'),
    n3.alias('n3')
).show(5)

+--------------------+--------------------+--------------------+
|                   n|                  n2|                  n3|
+--------------------+--------------------+--------------------+
|  -0.712390662050588|   0.712390662050588|   0.507500455376875|
|   0.753766378659703|  -0.753766378659703|  0.5681637535977627|
|-0.04450307833805...|0.044503078338053455|0.001980523981562...|
| 0.45181233874578974|-0.45181233874578974| 0.20413438944294027|
|  1.3451017084510097| -1.3451017084510097|  1.8092986060778251|
+--------------------+--------------------+--------------------+
only showing top 5 rows



What happens when you run the code below?

> df.group + df.abool

In [19]:
df.group + df.abool

# Since spark is 'lazy', this code only returns the column names

Column<'(group + abool)'>

What happens when you run the code below? What is the difference between this and the previous code sample?

> df.select(df.group + df.abool)

??????????????

Try adding various other columns together. What are the results of combining the different data types?

??????????????

---
### 3. Type casting

Use the starter code above to re-create a spark dataframe. \
Use .printSchema to view the datatypes in your dataframe.

In [28]:
df.printSchema()

root
 |-- n: double (nullable = true)
 |-- group: string (nullable = true)
 |-- abool: boolean (nullable = true)



Use .dtypes to view the datatypes in your dataframe.

In [30]:
df.dtypes

[('n', 'double'), ('group', 'string'), ('abool', 'boolean')]

What is the difference between the two code samples below?

>df.abool.cast('int') \
>df.select(df.abool.cast('int')).show()

In [33]:
# This code changes the column type to an interger type from a boolean
df.abool.cast('int')

Column<'CAST(abool AS INT)'>

In [36]:
# This code doest the same except it also shows the resulting column
df.select(df.abool, df.abool.cast('int')).show(5)

+-----+-----+
|abool|abool|
+-----+-----+
|false|    0|
|false|    0|
|false|    0|
|false|    0|
|false|    0|
+-----+-----+
only showing top 5 rows



Use .select and .cast to convert the abool column to an integer type. View the results.

In [37]:
df.select(df.abool, df.abool.cast('int')).show(5)

+-----+-----+
|abool|abool|
+-----+-----+
|false|    0|
|false|    0|
|false|    0|
|false|    0|
|false|    0|
+-----+-----+
only showing top 5 rows



Convert the group column to a integer data type and view the results. What happens?

In [39]:
df.select(df.group, df.group.cast('int')).show(5)

# Since the group column is just a string datatype
# and does not have any predetermined integer values, 
# the group values are turned to nulls

+-----+-----+
|group|group|
+-----+-----+
|    z| null|
|    x| null|
|    z| null|
|    y| null|
|    z| null|
+-----+-----+
only showing top 5 rows



Convert the n column to a integer data type and view the results. What happens?

In [46]:
df.select(df.n, df.n.cast('int')).show(10)

# the values are rounded

+--------------------+---+
|                   n|  n|
+--------------------+---+
|  -0.712390662050588|  0|
|   0.753766378659703|  0|
|-0.04450307833805...|  0|
| 0.45181233874578974|  0|
|  1.3451017084510097|  1|
|  0.5323378882945463|  0|
|  1.3501878997225267|  1|
|  0.8612113741693206|  0|
|  1.4786857374358966|  1|
| -1.0453771305385342| -1|
+--------------------+---+
only showing top 10 rows



Convert the abool column to a string data type and view the results. What happens?

In [51]:
df.select(df.abool, df.abool.cast('string')).show(10)

# The true/false values are simply converted to a string value

+-----+-----+
|abool|abool|
+-----+-----+
|false|false|
|false|false|
|false|false|
|false|false|
|false|false|
|false|false|
|false|false|
|false|false|
| true| true|
| true| true|
+-----+-----+
only showing top 10 rows



---
### 4. Built-in Functions

Use the starter code above to re-create a spark dataframe. \
Import the necessary functions from pyspark.sql.functions

In [52]:
from pyspark.sql.functions import min, max, mean, lit, concat

Find the highest n value.

In [56]:
df.select(max('n')).show()

+------------------+
|            max(n)|
+------------------+
|2.1503829673811126|
+------------------+



Find the lowest n value.

In [57]:
df.select(min('n')).show()

+------------------+
|            min(n)|
+------------------+
|-1.261605945319069|
+------------------+



Find the average n value.

In [58]:
df.select(mean('n')).show()

+-------------------+
|             avg(n)|
+-------------------+
|0.36640264498852165|
+-------------------+



Use concat to change the group column to say, e.g. "Group: x" or "Group: y"

In [66]:
df.select(
    'group',
    concat(lit('Group: '), 'group'),
).show()

+-----+----------------------+
|group|concat(Group: , group)|
+-----+----------------------+
|    z|              Group: z|
|    x|              Group: x|
|    z|              Group: z|
|    y|              Group: y|
|    z|              Group: z|
|    y|              Group: y|
|    z|              Group: z|
|    x|              Group: x|
|    z|              Group: z|
|    y|              Group: y|
|    x|              Group: x|
|    y|              Group: y|
|    y|              Group: y|
|    y|              Group: y|
|    y|              Group: y|
|    x|              Group: x|
|    z|              Group: z|
|    y|              Group: y|
|    x|              Group: x|
|    x|              Group: x|
+-----+----------------------+



Use concat to combine the n and group columns to produce results that look like this: "x: -1.432" or "z: 2.352"

In [73]:
df.select(
    'group',
    'n',
    concat('group', lit(': '), 'n')
).show()

+-----+--------------------+--------------------+
|group|                   n|concat(group, : , n)|
+-----+--------------------+--------------------+
|    z|  -0.712390662050588|z: -0.71239066205...|
|    x|   0.753766378659703|x: 0.753766378659703|
|    z|-0.04450307833805...|z: -0.04450307833...|
|    y| 0.45181233874578974|y: 0.451812338745...|
|    z|  1.3451017084510097|z: 1.345101708451...|
|    y|  0.5323378882945463|y: 0.532337888294...|
|    z|  1.3501878997225267|z: 1.350187899722...|
|    x|  0.8612113741693206|x: 0.861211374169...|
|    z|  1.4786857374358966|z: 1.478685737435...|
|    y| -1.0453771305385342|y: -1.04537713053...|
|    x| -0.7889890249515489|x: -0.78898902495...|
|    y|  -1.261605945319069|y: -1.26160594531...|
|    y|  0.5628467852810314|y: 0.562846785281...|
|    y|-0.24332625188556253|y: -0.24332625188...|
|    y|  0.9137407048596775|y: 0.913740704859...|
|    x| 0.31735092273633597|x: 0.317350922736...|
|    z| 0.12730328020698067|z: 0.127303280206...|


---
### 5. When / Otherwise

Use the starter code above to re-create a spark dataframe. \
Use when and .otherwise to create a column that contains the text "It is true" when abool is true and "It is false"" when abool is false.

In [79]:
from pyspark.sql.functions import when

In [83]:
df.select(
    'abool',
    when(df.abool == 'true', 'It is true').otherwise('It is false')
).show(10)

+-----+-------------------------------------------------------------+
|abool|CASE WHEN (abool = true) THEN It is true ELSE It is false END|
+-----+-------------------------------------------------------------+
|false|                                                  It is false|
|false|                                                  It is false|
|false|                                                  It is false|
|false|                                                  It is false|
|false|                                                  It is false|
|false|                                                  It is false|
|false|                                                  It is false|
|false|                                                  It is false|
| true|                                                   It is true|
| true|                                                   It is true|
+-----+-------------------------------------------------------------+
only showing top 10 

Create a column that contains 0 if n is less than 0, otherwise, the original n value.

In [85]:
df.select(
    'n',
    when(df.n < 0, 0).otherwise(df.n)
).show(10)

+--------------------+-----------------------------------+
|                   n|CASE WHEN (n < 0) THEN 0 ELSE n END|
+--------------------+-----------------------------------+
|  -0.712390662050588|                                0.0|
|   0.753766378659703|                  0.753766378659703|
|-0.04450307833805...|                                0.0|
| 0.45181233874578974|                0.45181233874578974|
|  1.3451017084510097|                 1.3451017084510097|
|  0.5323378882945463|                 0.5323378882945463|
|  1.3501878997225267|                 1.3501878997225267|
|  0.8612113741693206|                 0.8612113741693206|
|  1.4786857374358966|                 1.4786857374358966|
| -1.0453771305385342|                                0.0|
+--------------------+-----------------------------------+
only showing top 10 rows



---
### 6. Filter / Where

Use the starter code above to re-create a spark dataframe. \
Use .filter or .where to select just the rows where the group is y and view the results.

In [93]:
from pyspark.sql.functions import filter

In [94]:
df.where(df.group=='y').show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| 0.45181233874578974|    y|false|
|  0.5323378882945463|    y|false|
| -1.0453771305385342|    y| true|
|  -1.261605945319069|    y|false|
|  0.5628467852810314|    y| true|
|-0.24332625188556253|    y| true|
|  0.9137407048596775|    y|false|
|  2.1503829673811126|    y| true|
+--------------------+-----+-----+



In [99]:
df.filter(df.group=='y').show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| 0.45181233874578974|    y|false|
|  0.5323378882945463|    y|false|
| -1.0453771305385342|    y| true|
|  -1.261605945319069|    y|false|
|  0.5628467852810314|    y| true|
|-0.24332625188556253|    y| true|
|  0.9137407048596775|    y|false|
|  2.1503829673811126|    y| true|
+--------------------+-----+-----+



Select just the columns where the abool column is false and view the results.

In [98]:
df.where(df.abool=='false').show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
|  0.8612113741693206|    x|false|
| -0.7889890249515489|    x|false|
|  -1.261605945319069|    y|false|
|  0.9137407048596775|    y|false|
| 0.31735092273633597|    x|false|
| 0.12730328020698067|    z|false|
|  0.6062886568962988|    x|false|
+--------------------+-----+-----+



In [100]:
df.filter(df.abool=='false').show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
|  0.8612113741693206|    x|false|
| -0.7889890249515489|    x|false|
|  -1.261605945319069|    y|false|
|  0.9137407048596775|    y|false|
| 0.31735092273633597|    x|false|
| 0.12730328020698067|    z|false|
|  0.6062886568962988|    x|false|
+--------------------+-----+-----+



Find the columns where the group column is not y.

In [104]:
df.where(df.group!='y').show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
|  1.3451017084510097|    z|false|
|  1.3501878997225267|    z|false|
|  0.8612113741693206|    x|false|
|  1.4786857374358966|    z| true|
| -0.7889890249515489|    x|false|
| 0.31735092273633597|    x|false|
| 0.12730328020698067|    z|false|
|  0.6062886568962988|    x|false|
|-0.02677164998644...|    x| true|
+--------------------+-----+-----+



Find the columns where n is positive.

In [105]:
df.where(df.n >= 0).show()

+-------------------+-----+-----+
|                  n|group|abool|
+-------------------+-----+-----+
|  0.753766378659703|    x|false|
|0.45181233874578974|    y|false|
| 1.3451017084510097|    z|false|
| 0.5323378882945463|    y|false|
| 1.3501878997225267|    z|false|
| 0.8612113741693206|    x|false|
| 1.4786857374358966|    z| true|
| 0.5628467852810314|    y| true|
| 0.9137407048596775|    y|false|
|0.31735092273633597|    x|false|
|0.12730328020698067|    z|false|
| 2.1503829673811126|    y| true|
| 0.6062886568962988|    x|false|
+-------------------+-----+-----+



Find the columns where abool is true and the group column is z.

In [118]:
df.filter(df.abool & (df.group == 'z')).show()

+------------------+-----+-----+
|                 n|group|abool|
+------------------+-----+-----+
|1.4786857374358966|    z| true|
+------------------+-----+-----+



Find the columns where abool is true or the group column is z.

In [119]:
df.filter(df.abool | (df.group == 'z')).show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|-0.04450307833805...|    z|false|
|  1.3451017084510097|    z|false|
|  1.3501878997225267|    z|false|
|  1.4786857374358966|    z| true|
| -1.0453771305385342|    y| true|
|  0.5628467852810314|    y| true|
|-0.24332625188556253|    y| true|
| 0.12730328020698067|    z|false|
|  2.1503829673811126|    y| true|
|-0.02677164998644...|    x| true|
+--------------------+-----+-----+



Find the columns where abool is false and n is less than 1.

In [120]:
df.filter(~df.abool & (df.n < 1)).show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  0.5323378882945463|    y|false|
|  0.8612113741693206|    x|false|
| -0.7889890249515489|    x|false|
|  -1.261605945319069|    y|false|
|  0.9137407048596775|    y|false|
| 0.31735092273633597|    x|false|
| 0.12730328020698067|    z|false|
|  0.6062886568962988|    x|false|
+--------------------+-----+-----+



Find the columns where abool is false or n is less than 1.

In [121]:
df.filter(~df.abool | (df.n < 1)).show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
|  0.8612113741693206|    x|false|
| -1.0453771305385342|    y| true|
| -0.7889890249515489|    x|false|
|  -1.261605945319069|    y|false|
|  0.5628467852810314|    y| true|
|-0.24332625188556253|    y| true|
|  0.9137407048596775|    y|false|
| 0.31735092273633597|    x|false|
| 0.12730328020698067|    z|false|
|  0.6062886568962988|    x|false|
|-0.02677164998644...|    x| true|
+--------------------+-----+-----+



In [125]:
# find where group = z or n is less than 1
df.filter((df.group == 'z') | (df.n < 1)).show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
|  0.8612113741693206|    x|false|
|  1.4786857374358966|    z| true|
| -1.0453771305385342|    y| true|
| -0.7889890249515489|    x|false|
|  -1.261605945319069|    y|false|
|  0.5628467852810314|    y| true|
|-0.24332625188556253|    y| true|
|  0.9137407048596775|    y|false|
| 0.31735092273633597|    x|false|
| 0.12730328020698067|    z|false|
|  0.6062886568962988|    x|false|
|-0.02677164998644...|    x| true|
+--------------------+-----+-----+



### 7. Sorting

Use the starter code above to re-create a spark dataframe. \
Sort by the n value.

In [126]:
df.sort(df.n).show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -1.261605945319069|    y|false|
| -1.0453771305385342|    y| true|
| -0.7889890249515489|    x|false|
|  -0.712390662050588|    z|false|
|-0.24332625188556253|    y| true|
|-0.04450307833805...|    z|false|
|-0.02677164998644...|    x| true|
| 0.12730328020698067|    z|false|
| 0.31735092273633597|    x|false|
| 0.45181233874578974|    y|false|
|  0.5323378882945463|    y|false|
|  0.5628467852810314|    y| true|
|  0.6062886568962988|    x|false|
|   0.753766378659703|    x|false|
|  0.8612113741693206|    x|false|
|  0.9137407048596775|    y|false|
|  1.3451017084510097|    z|false|
|  1.3501878997225267|    z|false|
|  1.4786857374358966|    z| true|
|  2.1503829673811126|    y| true|
+--------------------+-----+-----+



Sort by the group value, both ascending and descending.

In [128]:
from pyspark.sql.functions import asc, desc

In [130]:
df.orderBy(desc('group')).show(10)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| 0.12730328020698067|    z|false|
|  1.3501878997225267|    z|false|
|  1.3451017084510097|    z|false|
|-0.04450307833805...|    z|false|
|  -0.712390662050588|    z|false|
|  1.4786857374358966|    z| true|
|  0.9137407048596775|    y|false|
|  -1.261605945319069|    y|false|
|  2.1503829673811126|    y| true|
|-0.24332625188556253|    y| true|
+--------------------+-----+-----+
only showing top 10 rows



In [131]:
df.orderBy(asc('group')).show(10)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  0.8612113741693206|    x|false|
|  0.6062886568962988|    x|false|
|   0.753766378659703|    x|false|
| -0.7889890249515489|    x|false|
|-0.02677164998644...|    x| true|
| 0.31735092273633597|    x|false|
| -1.0453771305385342|    y| true|
|  0.9137407048596775|    y|false|
| 0.45181233874578974|    y|false|
|  -1.261605945319069|    y|false|
+--------------------+-----+-----+
only showing top 10 rows



Sort by the group value first, then, within each group, sort by n value.

In [133]:
df.sort('group', 'n').show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| -0.7889890249515489|    x|false|
|-0.02677164998644...|    x| true|
| 0.31735092273633597|    x|false|
|  0.6062886568962988|    x|false|
|   0.753766378659703|    x|false|
|  0.8612113741693206|    x|false|
|  -1.261605945319069|    y|false|
| -1.0453771305385342|    y| true|
|-0.24332625188556253|    y| true|
| 0.45181233874578974|    y|false|
|  0.5323378882945463|    y|false|
|  0.5628467852810314|    y| true|
|  0.9137407048596775|    y|false|
|  2.1503829673811126|    y| true|
|  -0.712390662050588|    z|false|
|-0.04450307833805...|    z|false|
| 0.12730328020698067|    z|false|
|  1.3451017084510097|    z|false|
|  1.3501878997225267|    z|false|
|  1.4786857374358966|    z| true|
+--------------------+-----+-----+



Sort by abool, group, and n. Does it matter in what order you specify the columns when sorting?

In [135]:
df.sort('abool', 'group', 'n').show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
| -0.7889890249515489|    x|false|
| 0.31735092273633597|    x|false|
|  0.6062886568962988|    x|false|
|   0.753766378659703|    x|false|
|  0.8612113741693206|    x|false|
|  -1.261605945319069|    y|false|
| 0.45181233874578974|    y|false|
|  0.5323378882945463|    y|false|
|  0.9137407048596775|    y|false|
|  -0.712390662050588|    z|false|
|-0.04450307833805...|    z|false|
| 0.12730328020698067|    z|false|
|  1.3451017084510097|    z|false|
|  1.3501878997225267|    z|false|
|-0.02677164998644...|    x| true|
| -1.0453771305385342|    y| true|
|-0.24332625188556253|    y| true|
|  0.5628467852810314|    y| true|
|  2.1503829673811126|    y| true|
|  1.4786857374358966|    z| true|
+--------------------+-----+-----+



In [137]:
# the order of sort matters since the .sort() method works from left to right
df.sort('n', 'group', 'abool').show()

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -1.261605945319069|    y|false|
| -1.0453771305385342|    y| true|
| -0.7889890249515489|    x|false|
|  -0.712390662050588|    z|false|
|-0.24332625188556253|    y| true|
|-0.04450307833805...|    z|false|
|-0.02677164998644...|    x| true|
| 0.12730328020698067|    z|false|
| 0.31735092273633597|    x|false|
| 0.45181233874578974|    y|false|
|  0.5323378882945463|    y|false|
|  0.5628467852810314|    y| true|
|  0.6062886568962988|    x|false|
|   0.753766378659703|    x|false|
|  0.8612113741693206|    x|false|
|  0.9137407048596775|    y|false|
|  1.3451017084510097|    z|false|
|  1.3501878997225267|    z|false|
|  1.4786857374358966|    z| true|
|  2.1503829673811126|    y| true|
+--------------------+-----+-----+



---
### 8. Spark SQL

Use the starter code above to re-create a spark dataframe. \
Turn your dataframe into a table that can be queried with spark SQL. Name the table my_df. Answer the rest of the questions in this section with a spark sql query (spark.sql) against my_df. After each step, view the first 7 records from the dataframe.

Write a query that shows all of the columns from your dataframe.
Write a query that shows just the n and abool columns from the dataframe.
Write a query that shows just the n and group columns. Rename the group column to g.
Write a query that selects n, and creates two new columns: n2, the original n values halved, and n3: the original n values minus 1.
What happens if you make a SQL syntax error in your query?

9. Aggregating

What is the average n value for each group in the group column?
What is the maximum n value for each group in the group column?
What is the minimum n value by abool?
What is the average n value for each unique combination of the group and abool column?