# Spark Mini Exercises

In [1]:
import pandas as pd
import numpy as np
import pyspark

In [2]:
np.random.seed(13)

pandas_dataframe = pd.DataFrame(
    {
        "n": np.random.randn(20),
        "group": np.random.choice(list("xyz"), 20),
        "abool": np.random.choice([True, False], 20),
    }
)

In [3]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()

Spark Dataframe Basics

1. Use the starter code above to create a pandas dataframe.
1. Convert the pandas dataframe to a spark dataframe. From this point forward, do all of your work with the spark dataframe, not the pandas dataframe.
1. Show the first 3 rows of the dataframe.
1. Show the first 7 rows of the dataframe.
1. View a summary of the data using .describe.
1. Use .select to create a new dataframe with just the n and abool columns. View the first 5 rows of this dataframe.
1. Use .select to create a new dataframe with just the group and abool columns. View the first 5 rows of this dataframe.
1. Use .select to create a new dataframe with the group column and the abool column renamed to a_boolean_value. Show the first 3 rows of this dataframe.
1. Use .select to create a new dataframe with the group column and the n column renamed to a_numeric_value. Show the first 6 rows of this dataframe.

In [4]:
df = spark.createDataFrame(pandas_dataframe)

In [5]:
# show first 3 rows
df.show(3)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
+--------------------+-----+-----+
only showing top 3 rows



In [6]:
# show first 7 rows
df.show(7)

+--------------------+-----+-----+
|                   n|group|abool|
+--------------------+-----+-----+
|  -0.712390662050588|    z|false|
|   0.753766378659703|    x|false|
|-0.04450307833805...|    z|false|
| 0.45181233874578974|    y|false|
|  1.3451017084510097|    z|false|
|  0.5323378882945463|    y|false|
|  1.3501878997225267|    z|false|
+--------------------+-----+-----+
only showing top 7 rows



In [7]:
df.describe().show()

+-------+------------------+-----+
|summary|                 n|group|
+-------+------------------+-----+
|  count|                20|   20|
|   mean|0.3664026449885217| null|
| stddev|0.8905322898155363| null|
|    min|-1.261605945319069|    x|
|    max|2.1503829673811126|    z|
+-------+------------------+-----+



In [11]:
# Use .select to create a new dataframe with just the n and abool columns. 
# View the first 5 rows of this dataframe.

df2 = df.select('n', 'abool')

df2.show(5)

+--------------------+-----+
|                   n|abool|
+--------------------+-----+
|  -0.712390662050588|false|
|   0.753766378659703|false|
|-0.04450307833805...|false|
| 0.45181233874578974|false|
|  1.3451017084510097|false|
+--------------------+-----+
only showing top 5 rows



In [14]:
# Use .select to create a new dataframe with just the group and abool columns. 
# View the first 5 rows of this dataframe.

df3 = df.select('group', 'abool')

df3.show(5)

+-----+-----+
|group|abool|
+-----+-----+
|    z|false|
|    x|false|
|    z|false|
|    y|false|
|    z|false|
+-----+-----+
only showing top 5 rows



In [18]:
# Use .select to create a new dataframe with the group column and the abool column renamed to a_boolean_value. 
# Show the first 3 rows of this dataframe.


df4 = df3.select('group', df3.abool.alias('a_boolean_value'))

df4.show(3)

+-----+---------------+
|group|a_boolean_value|
+-----+---------------+
|    z|          false|
|    x|          false|
|    z|          false|
+-----+---------------+
only showing top 3 rows



In [20]:
# Use .select to create a new dataframe with the group column and the n column renamed to a_numeric_value. 
# Show the first 6 rows of this dataframe.

df5 = df.select('group', df.n.alias('a_numeric_value'))
df5.show(6)

+-----+--------------------+
|group|     a_numeric_value|
+-----+--------------------+
|    z|  -0.712390662050588|
|    x|   0.753766378659703|
|    z|-0.04450307833805...|
|    y| 0.45181233874578974|
|    z|  1.3451017084510097|
|    y|  0.5323378882945463|
+-----+--------------------+
only showing top 6 rows



Column Manipulation

1. Use the starter code above to re-create a spark dataframe. Store the spark dataframe in a varaible named df

1. Use .select to add 4 to the n column. Show the results.

1. Subtract 5 from the n column and view the results.

1. Multiply the n column by 2. View the results along with the original numbers.

1. Add a new column named n2 that is the n value multiplied by -1. Show the first 4 rows of your dataframe. You should see the original n value as well as n2.

1. Add a new column named n3 that is the n value squared. Show the first 5 rows of your dataframe. You should see both n, n2, and n3.

1. What happens when you run the code below?
   - df.group + df.abool
   
1. What happens when you run the code below? What is the difference between this and the previous code sample?
   - df.select(df.group + df.abool)
1. Try adding various other columns together. What are the results of combining the different data types?



In [22]:
# add 4 to n column
df.select(df.n + 4).show(5)

+------------------+
|           (n + 4)|
+------------------+
|3.2876093379494122|
| 4.753766378659703|
|3.9554969216619464|
|  4.45181233874579|
|5.3451017084510095|
+------------------+
only showing top 5 rows



In [23]:
# subtract 5
df.select(df.n - 5).show(5)

+-------------------+
|            (n - 5)|
+-------------------+
| -5.712390662050588|
| -4.246233621340297|
| -5.044503078338053|
|  -4.54818766125421|
|-3.6548982915489905|
+-------------------+
only showing top 5 rows



In [24]:
df.select('n', df.n*2).show(5)

+--------------------+--------------------+
|                   n|             (n * 2)|
+--------------------+--------------------+
|  -0.712390662050588|  -1.424781324101176|
|   0.753766378659703|   1.507532757319406|
|-0.04450307833805...|-0.08900615667610691|
| 0.45181233874578974|  0.9036246774915795|
|  1.3451017084510097|  2.6902034169020195|
+--------------------+--------------------+
only showing top 5 rows



In [30]:
# Add a new column named n2 that is the n value multiplied by -1. 
# Show the first 4 rows of your dataframe. 
# You should see the original n value as well as n2.

df_n= df.select('n', (df.n*-1).alias('n2'))
df_n.show(4)

+--------------------+--------------------+
|                   n|                  n2|
+--------------------+--------------------+
|  -0.712390662050588|   0.712390662050588|
|   0.753766378659703|  -0.753766378659703|
|-0.04450307833805...|0.044503078338053455|
| 0.45181233874578974|-0.45181233874578974|
+--------------------+--------------------+
only showing top 4 rows



In [31]:
# Add a new column named n3 that is the n value squared. 
# Show the first 5 rows of your dataframe. You should see both n, n2, and n3.

df_n.select('*', (df.n**2).alias('n3')).show(5)

+--------------------+--------------------+--------------------+
|                   n|                  n2|                  n3|
+--------------------+--------------------+--------------------+
|  -0.712390662050588|   0.712390662050588|   0.507500455376875|
|   0.753766378659703|  -0.753766378659703|  0.5681637535977627|
|-0.04450307833805...|0.044503078338053455|0.001980523981562...|
| 0.45181233874578974|-0.45181233874578974| 0.20413438944294027|
|  1.3451017084510097| -1.3451017084510097|  1.8092986060778251|
+--------------------+--------------------+--------------------+
only showing top 5 rows



In [32]:
df.group + df.abool

Column<'(group + abool)'>

In [33]:
df.select(df.group + df.abool)

AnalysisException: cannot resolve '(CAST(`group` AS DOUBLE) + `abool`)' due to data type mismatch: differing types in '(CAST(`group` AS DOUBLE) + `abool`)' (double and boolean).;
'Project [(cast(group#1 as double) + abool#2) AS (group + abool)#407]
+- LogicalRDD [n#0, group#1, abool#2], false
