# when() example

- The `when()` clause lets you conditionally modify a Data Frame based on its content. You'll want to modify our `voter_df` DataFrame to add a random number to any voting member that is defined as a `"Councilmember"`.

- The `voter_df` DataFrame is defined and available to you. The `pyspark.sql.functions` library is available as `F`. You can use `F.rand()` to generate the random value.

## Instructions

- Add a column to `voter_df` named `random_val` with the results of the `F.rand()` method for any voter with the title `Councilmember`.
- Show some of the DataFrame rows, noting whether the `.when()` clause worked.

In [1]:
# Intialization
import os
import sys

os.environ["SPARK_HOME"] = "/home/talentum/spark"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "/python/lib"
# In below two lines, use /usr/bin/python2.7 if you want to use Python 2
os.environ["PYSPARK_PYTHON"] = "/usr/bin/python3.6" 
os.environ["PYSPARK_DRIVER_PYTHON"] = "/usr/bin/python3"
sys.path.insert(0, os.environ["PYLIB"] +"/py4j-0.10.7-src.zip")
sys.path.insert(0, os.environ["PYLIB"] +"/pyspark.zip")

# NOTE: Whichever package you want mention here.
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell' 
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.3 pyspark-shell'
# os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0,org.apache.spark:spark-avro_2.11:2.4.0 pyspark-shell'

In [2]:
#Entrypoint 2.x
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().getOrCreate()

# On yarn:
# spark = SparkSession.builder.appName("Spark SQL basic example").enableHiveSupport().master("yarn").getOrCreate()
# specify .master("yarn")

sc = spark.sparkContext

In [5]:
import pyspark.sql.functions as F

# Load the CSV file
voter_df = spark.read.format('csv').options(Header=True).load('file:///home/talentum/test-jupyter/P3/M2/sm2/2_ConditionalDataFramecolumnoperations/Dataset/DallasCouncilVoters.csv.gz')
print(voter_df.printSchema())
# Add a column to voter_df for any voter with the title **Councilmember**
voter_df = voter_df.withColumn('random_val',
                               F.when(F.col("TITLE") == "Councilmember", F.rand()))

# Show some of the DataFrame rows, noting whether the when clause worked
voter_df.show()

root
 |-- DATE: string (nullable = true)
 |-- TITLE: string (nullable = true)
 |-- VOTER_NAME: string (nullable = true)

None
+----------+-------------+-------------------+-------------------+
|      DATE|        TITLE|         VOTER_NAME|         random_val|
+----------+-------------+-------------------+-------------------+
|02/08/2017|Councilmember|  Jennifer S. Gates| 0.6070645070880518|
|02/08/2017|Councilmember| Philip T. Kingston| 0.8884028153454652|
|02/08/2017|        Mayor|Michael S. Rawlings|               null|
|02/08/2017|Councilmember|       Adam Medrano| 0.4948274576791183|
|02/08/2017|Councilmember|       Casey Thomas| 0.1736714404291938|
|02/08/2017|Councilmember|Carolyn King Arnold|  0.636957357539489|
|02/08/2017|Councilmember|       Scott Griggs| 0.5424620310937894|
|02/08/2017|Councilmember|   B. Adam  McGough|0.39948499070448895|
|02/08/2017|Councilmember|       Lee Kleinman|0.09049447798091215|
|02/08/2017|Councilmember|      Sandy Greyson|0.33236458809841996|
|02