# Aquisition
This exercises uses the case.csv and dept.csv files from the san antonio 311 call dataset.

1. read into spark environment (df_case, df_dept) 

In [1]:
import pyspark
import pandas as pd
from pyspark.sql import SparkSession
import pyspark.sql.types as T

In [2]:
spark = SparkSession.builder.master("local").appName("read").\
    enableHiveSupport().\
    getOrCreate()

In [3]:
df = spark.read.format("csv").\
    option("sep", ",").\
    option("header", True).\
    option("inferSchema", True).\
    load("sa311/source.csv")

df.printSchema()

root
 |-- source_id: string (nullable = true)
 |-- source_username: string (nullable = true)



In [4]:
df.show()

+---------+--------------------+
|source_id|     source_username|
+---------+--------------------+
|   100137|    Merlene Blodgett|
|   103582|         Carmen Cura|
|   106463|     Richard Sanchez|
|   119403|      Betty De Hoyos|
|   119555|      Socorro Quiara|
|   119868| Michelle San Miguel|
|   120752|      Eva T. Kleiber|
|   124405|           Lori Lara|
|   132408|       Leonard Silva|
|   135723|        Amy Cardenas|
|   136202|    Michelle Urrutia|
|   136979|      Leticia Garcia|
|   137943|    Pamela K. Baccus|
|   138605|        Marisa Ozuna|
|   138650|      Kimberly Green|
|   138650|Kimberly Green-Woods|
|   138793| Guadalupe Rodriguez|
|   138810|       Tawona Martin|
|   139342|     Jessica Mendoza|
|   139344|        Isis Mendoza|
+---------+--------------------+
only showing top 20 rows



In [5]:
df_case = spark.read.format("csv").\
    option("sep", ",").\
    option("header", True).\
    option("inferSchema", True).\
    load("sa311/case.csv")

df_case.printSchema()

root
 |-- case_id: integer (nullable = true)
 |-- case_opened_date: string (nullable = true)
 |-- case_closed_date: string (nullable = true)
 |-- SLA_due_date: string (nullable = true)
 |-- case_late: string (nullable = true)
 |-- num_days_late: double (nullable = true)
 |-- case_closed: string (nullable = true)
 |-- dept_division: string (nullable = true)
 |-- service_request_type: string (nullable = true)
 |-- SLA_days: double (nullable = true)
 |-- case_status: string (nullable = true)
 |-- source_id: string (nullable = true)
 |-- request_address: string (nullable = true)
 |-- council_district: integer (nullable = true)



In [6]:
df_case

DataFrame[case_id: int, case_opened_date: string, case_closed_date: string, SLA_due_date: string, case_late: string, num_days_late: double, case_closed: string, dept_division: string, service_request_type: string, SLA_days: double, case_status: string, source_id: string, request_address: string, council_district: int]

In [7]:
df_dept = spark.read.format("csv").\
    option("sep", ",").\
    option("header", True).\
    option("inferSchema", True).\
    load("sa311/dept.csv")

df_dept.printSchema()

root
 |-- dept_division: string (nullable = true)
 |-- dept_name: string (nullable = true)
 |-- standardized_dept_name: string (nullable = true)
 |-- dept_subject_to_SLA: string (nullable = true)



In [8]:
df_dept # note: this is a spark dataframe

DataFrame[dept_division: string, dept_name: string, standardized_dept_name: string, dept_subject_to_SLA: string]

2. write df_case and df_dept back to disk into their own directories (my_cases and my_depts) 

In [9]:
df_case.write.format('csv').mode("overwrite").\
    option("header","true").save("sa311/my_cases")

In [10]:
df_dept.write.format('csv').mode("overwrite").\
    option("header","true").save("sa311/my_depts")

3. Write df_case and df_dept to parquet files (my_cases_parquet and my_depts_parquet) 

##### Parquet is a very popular columnar storage format for Hadoop.
Below will result in an error if you are using Java 11 or 12 due to current bug... 'Unsupported class file major version 56' https://github.com/gettyimages/docker-spark/issues/56

In [11]:
df_case.write.format('parquet').mode('overwrite').\
    option('header','true').save('sa311/my_cases_parquet')

In [12]:
df_dept.write.format('parquet').mode('overwrite').\
    option('header','true').save('sa311/my_depts_parquet')

4. Read your parquet files back into your spark environment. 

In [13]:
df_case = spark.read.format('parquet').\
    option("header", True).\
    option("inferSchema", True).\
    load("sa311/my_cases_parquet")
df_case # spark dataframe again

DataFrame[case_id: int, case_opened_date: string, case_closed_date: string, SLA_due_date: string, case_late: string, num_days_late: double, case_closed: string, dept_division: string, service_request_type: string, SLA_days: double, case_status: string, source_id: string, request_address: string, council_district: int]

In [14]:
df_dept = spark.read.format('parquet').\
    option("header", True).\
    option("inferSchema", True).\
    load("sa311/my_depts_parquet")

5. Read case.csv and dept.csv into a pandas dataframe. (cases_pdf, depts_pdf) 

In [15]:
cases_pdf = pd.read_csv("sa311/case.csv", sep=",")
cases_pdf.head()

Unnamed: 0,case_id,case_opened_date,case_closed_date,SLA_due_date,case_late,num_days_late,case_closed,dept_division,service_request_type,SLA_days,case_status,source_id,request_address,council_district
0,1014127332,1/1/18 0:42,1/1/18 12:29,9/26/20 0:42,NO,-998.508762,YES,Field Operations,Stray Animal,999.0,Closed,svcCRMLS,"2315 EL PASO ST, San Antonio, 78207",5
1,1014127333,1/1/18 0:46,1/3/18 8:11,1/5/18 8:30,NO,-2.012604,YES,Storm Water,Removal Of Obstruction,4.322222,Closed,svcCRMSS,"2215 GOLIAD RD, San Antonio, 78223",3
2,1014127334,1/1/18 0:48,1/2/18 7:57,1/5/18 8:30,NO,-3.022338,YES,Storm Water,Removal Of Obstruction,4.320729,Closed,svcCRMSS,"102 PALFREY ST W, San Antonio, 78223",3
3,1014127335,1/1/18 1:29,1/2/18 8:13,1/17/18 8:30,NO,-15.011481,YES,Code Enforcement,Front Or Side Yard Parking,16.291887,Closed,svcCRMSS,"114 LA GARDE ST, San Antonio, 78223",3
4,1014127336,1/1/18 1:34,1/1/18 13:29,1/1/18 4:34,YES,0.372164,YES,Field Operations,Animal Cruelty(Critical),0.125,Closed,svcCRMSS,"734 CLEARVIEW DR, San Antonio, 78228",7


In [16]:
depts_pdf = pd.read_csv("sa311/dept.csv", sep=",")
depts_pdf.head()

Unnamed: 0,dept_division,dept_name,standardized_dept_name,dept_subject_to_SLA
0,311 Call Center,Customer Service,Customer Service,YES
1,Brush,Solid Waste Management,Solid Waste,YES
2,Clean and Green,Parks and Recreation,Parks & Recreation,YES
3,Clean and Green Natural Areas,Parks and Recreation,Parks & Recreation,YES
4,Code Enforcement,Code Enforcement Services,DSD/Code Enforcement,YES


6. Convert the pandas dataframes into spark dataframes (cases_sdf, depts_sdf) 

In [17]:
spark = pyspark.sql.SparkSession.builder.getOrCreate()

In [18]:
cases_pdf.columns

Index(['case_id', 'case_opened_date', 'case_closed_date', 'SLA_due_date',
       'case_late', 'num_days_late', 'case_closed', 'dept_division',
       'service_request_type', 'SLA_days', 'case_status', 'source_id',
       'request_address', 'council_district'],
      dtype='object')

In [19]:
schema = T.StructType([
        T.StructField('case_id', T.StringType()),
        T.StructField('case_opened_date', T.StringType()),
        T.StructField('case_closed_date', T.StringType()),
        T.StructField('SLA_due_date', T.StringType()),
        T.StructField('case_late', T.StringType()),
        T.StructField('num_days_late', T.StringType()),
        T.StructField('case_closed', T.StringType()),
        T.StructField('dept_division', T.StringType()),
        T.StructField('service_request_type', T.StringType()),
        T.StructField('SLA_days', T.StringType()),
        T.StructField('case_status', T.StringType()),
        T.StructField('source_id', T.StringType()),
        T.StructField('request_address', T.StringType()),
        T.StructField('council_district', T.StringType())
        ])

In [20]:
cases_sdf = spark.createDataFrame(cases_pdf, schema=schema)

In [21]:
cases_sdf.show(2)

+----------+----------------+----------------+------------+---------+-------------+-----------+----------------+--------------------+-----------+-----------+---------+--------------------+----------------+
|   case_id|case_opened_date|case_closed_date|SLA_due_date|case_late|num_days_late|case_closed|   dept_division|service_request_type|   SLA_days|case_status|source_id|     request_address|council_district|
+----------+----------------+----------------+------------+---------+-------------+-----------+----------------+--------------------+-----------+-----------+---------+--------------------+----------------+
|1014127332|     1/1/18 0:42|    1/1/18 12:29|9/26/20 0:42|       NO| -998.5087616|        YES|Field Operations|        Stray Animal|      999.0|     Closed| svcCRMLS|2315  EL PASO ST,...|               5|
|1014127333|     1/1/18 0:46|     1/3/18 8:11| 1/5/18 8:30|       NO| -2.012604167|        YES|     Storm Water|Removal Of Obstru...|4.322222222|     Closed| svcCRMSS|2215  GOL

In [22]:
depts_pdf.columns

Index(['dept_division', 'dept_name', 'standardized_dept_name',
       'dept_subject_to_SLA'],
      dtype='object')

In [23]:
schema = T.StructType([
        T.StructField('dept_division', T.StringType()),
        T.StructField('dept_name', T.StringType()),
        T.StructField('standardized_dept_name', T.StringType()),
        T.StructField('dept_subject_to_SLA', T.StringType())
        ])

In [24]:
depts_sdf = spark.createDataFrame(depts_pdf, schema=schema)

In [25]:
depts_sdf.show(2)

+---------------+--------------------+----------------------+-------------------+
|  dept_division|           dept_name|standardized_dept_name|dept_subject_to_SLA|
+---------------+--------------------+----------------------+-------------------+
|311 Call Center|    Customer Service|      Customer Service|                YES|
|          Brush|Solid Waste Manag...|           Solid Waste|                YES|
+---------------+--------------------+----------------------+-------------------+
only showing top 2 rows



7. Convert the spark dataframes back into pandas dataframes. (cases_pdf1, depts_pdf1) 

In [26]:
cases_pdf1 = cases_sdf.toPandas()
cases_pdf1.head()

Unnamed: 0,case_id,case_opened_date,case_closed_date,SLA_due_date,case_late,num_days_late,case_closed,dept_division,service_request_type,SLA_days,case_status,source_id,request_address,council_district
0,1014127332,1/1/18 0:42,1/1/18 12:29,9/26/20 0:42,NO,-998.5087616,YES,Field Operations,Stray Animal,999.0,Closed,svcCRMLS,"2315 EL PASO ST, San Antonio, 78207",5
1,1014127333,1/1/18 0:46,1/3/18 8:11,1/5/18 8:30,NO,-2.012604167,YES,Storm Water,Removal Of Obstruction,4.322222222,Closed,svcCRMSS,"2215 GOLIAD RD, San Antonio, 78223",3
2,1014127334,1/1/18 0:48,1/2/18 7:57,1/5/18 8:30,NO,-3.022337963,YES,Storm Water,Removal Of Obstruction,4.320729167,Closed,svcCRMSS,"102 PALFREY ST W, San Antonio, 78223",3
3,1014127335,1/1/18 1:29,1/2/18 8:13,1/17/18 8:30,NO,-15.01148148,YES,Code Enforcement,Front Or Side Yard Parking,16.29188657,Closed,svcCRMSS,"114 LA GARDE ST, San Antonio, 78223",3
4,1014127336,1/1/18 1:34,1/1/18 13:29,1/1/18 4:34,YES,0.372164352,YES,Field Operations,Animal Cruelty(Critical),0.125,Closed,svcCRMSS,"734 CLEARVIEW DR, San Antonio, 78228",7


In [27]:
cases_pdf1.shape

(841704, 14)

In [28]:
depts_pdf1 = depts_sdf.toPandas()
depts_pdf1.head()

Unnamed: 0,dept_division,dept_name,standardized_dept_name,dept_subject_to_SLA
0,311 Call Center,Customer Service,Customer Service,YES
1,Brush,Solid Waste Management,Solid Waste,YES
2,Clean and Green,Parks and Recreation,Parks & Recreation,YES
3,Clean and Green Natural Areas,Parks and Recreation,Parks & Recreation,YES
4,Code Enforcement,Code Enforcement Services,DSD/Code Enforcement,YES


In [29]:
depts_pdf1.shape

(39, 4)

8. Read from the tables into two spark dataframes (cases_sdf, depts_sdf)

In [30]:
cases_sdf = spark.createDataFrame(cases_pdf1)

In [31]:
# cases_sdf.show(5) # never finished running
# Apparently, this is too much data for my 
# laptop to process.

In [32]:
depts_sdf = spark.createDataFrame(depts_pdf1)

In [33]:
# depts_sdf.show(5) # never finished running
# Apparently, this is too much data for my 
# laptop to process.

In [34]:
spark.stop()

# Inspect
Continue working with the acquire file.

1. Read the 311 case data into a Spark DataFrame.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

from pyspark.sql import SparkSession, DataFrame, Column, Row, GroupedData, \
    DataFrameNaFunctions, DataFrameStatFunctions, functions, types, Window
from pyspark.sql import functions as f
import pyspark.sql.types as T
from pyspark.sql.functions import round
from pyspark.sql.functions import format_string
from pyspark.sql.functions import trim, upper
from pyspark.sql.functions import substring
from pyspark.sql.functions import regexp_extract


In [2]:
# import pyspark
# import pandas as pd
# from pyspark.sql import SparkSession
# import pyspark.sql.types as T

In [4]:
spark = SparkSession.builder.master("local").appName("read").\
    enableHiveSupport().\
    getOrCreate()

### NOTE: Now calling df_case just df.

In [6]:
data = "sa311/case.csv"
df = spark.read.csv(data, header=True, inferSchema=True)

# df = spark.read.format("csv").\
#     option("sep", ",").\
#     option("header", True).\
#     option("inferSchema", True).\
#     load(data)

2. Inspect the DataFrame. Are the data types for each column appropriate? Cast the data to appropriate types as needed.

In [7]:
type(df)

pyspark.sql.dataframe.DataFrame

In [8]:
shape = (df.count(), len(df.columns))
print(shape)

(841704, 14)


In [9]:
df.printSchema()

root
 |-- case_id: integer (nullable = true)
 |-- case_opened_date: string (nullable = true)
 |-- case_closed_date: string (nullable = true)
 |-- SLA_due_date: string (nullable = true)
 |-- case_late: string (nullable = true)
 |-- num_days_late: double (nullable = true)
 |-- case_closed: string (nullable = true)
 |-- dept_division: string (nullable = true)
 |-- service_request_type: string (nullable = true)
 |-- SLA_days: double (nullable = true)
 |-- case_status: string (nullable = true)
 |-- source_id: string (nullable = true)
 |-- request_address: string (nullable = true)
 |-- council_district: integer (nullable = true)



In [10]:
df.select('*').limit(3).toPandas().head()

Unnamed: 0,case_id,case_opened_date,case_closed_date,SLA_due_date,case_late,num_days_late,case_closed,dept_division,service_request_type,SLA_days,case_status,source_id,request_address,council_district
0,1014127332,1/1/18 0:42,1/1/18 12:29,9/26/20 0:42,NO,-998.508762,YES,Field Operations,Stray Animal,999.0,Closed,svcCRMLS,"2315 EL PASO ST, San Antonio, 78207",5
1,1014127333,1/1/18 0:46,1/3/18 8:11,1/5/18 8:30,NO,-2.012604,YES,Storm Water,Removal Of Obstruction,4.322222,Closed,svcCRMSS,"2215 GOLIAD RD, San Antonio, 78223",3
2,1014127334,1/1/18 0:48,1/2/18 7:57,1/5/18 8:30,NO,-3.022338,YES,Storm Water,Removal Of Obstruction,4.320729,Closed,svcCRMSS,"102 PALFREY ST W, San Antonio, 78223",3


Convert the days late to weeks late.

In [11]:
from pyspark.sql.functions import round

# Round to two decimal places, 
# use alias to rename column, 
# add a new column with a new column name using withColumn.
# df.select("num_days_late", round(df.num_days_late / 7.00, 2)\
#   .alias("num_weeks_late")) \
#   .show(5)

df.select("num_days_late") \
  .withColumn("num_weeks_late", 
              round(df.num_days_late / 7, 2)) \
  .limit(10) \
  .toPandas() \
  .head()

Unnamed: 0,num_days_late,num_weeks_late
0,-998.508762,-142.64
1,-2.012604,-0.29
2,-3.022338,-0.43
3,-15.011481,-2.14
4,0.372164,0.05


Convert the council_district from an integer to a string.

In [12]:
# Note that "%010d" makes the field have ten digits,
# padded with zeros as necessary.
from pyspark.sql.functions import format_string
df.select("council_district", format_string("%010d", "council_district").\
          alias("council_district_fixed")).show(5)

+----------------+----------------------+
|council_district|council_district_fixed|
+----------------+----------------------+
|               5|            0000000005|
|               3|            0000000003|
|               3|            0000000003|
|               3|            0000000003|
|               7|            0000000007|
+----------------+----------------------+
only showing top 5 rows



Convert the case_closed flag from a string to a Boolean.

In [13]:
df.select("case_closed", (df["case_closed"] == "YES")\
  .alias("case_closed_boolean")) \
  .show(5)

+-----------+-------------------+
|case_closed|case_closed_boolean|
+-----------+-------------------+
|        YES|               true|
|        YES|               true|
|        YES|               true|
|        YES|               true|
|        YES|               true|
+-----------+-------------------+
only showing top 5 rows



In [14]:
df.select("case_closed", "case_closed_date")\
  .filter(df["case_closed"] == "NO") \
  .show(5)

+-----------+----------------+
|case_closed|case_closed_date|
+-----------+----------------+
|         NO|            null|
|         NO|            null|
|         NO|            null|
|         NO|            null|
|         NO|            null|
+-----------+----------------+
only showing top 5 rows



Normalize the address column.

In [15]:
# Trim whitespace and convert request_address to uppercase
from pyspark.sql.functions import trim, upper
df.select("request_address", upper(trim(df.request_address)) \
  .alias("request_address_upper")) \
  .show(5)

+--------------------+---------------------+
|     request_address|request_address_upper|
+--------------------+---------------------+
|2315  EL PASO ST,...| 2315  EL PASO ST,...|
|2215  GOLIAD RD, ...| 2215  GOLIAD RD, ...|
|102  PALFREY ST W...| 102  PALFREY ST W...|
|114  LA GARDE ST,...| 114  LA GARDE ST,...|
|734  CLEARVIEW DR...| 734  CLEARVIEW DR...|
+--------------------+---------------------+
only showing top 5 rows



Extract the zip code from the address.

In [16]:
from pyspark.sql.functions import substring
df.select("request_address", 
          substring("request_address", -5, 5). \
          alias("request_address_zip")) \
  .show(5)

+--------------------+-------------------+
|     request_address|request_address_zip|
+--------------------+-------------------+
|2315  EL PASO ST,...|              78207|
|2215  GOLIAD RD, ...|              78223|
|102  PALFREY ST W...|              78223|
|114  LA GARDE ST,...|              78223|
|734  CLEARVIEW DR...|              78228|
+--------------------+-------------------+
only showing top 5 rows



OR could extract zip code using a regular expression.

In [17]:
from pyspark.sql.functions import regexp_extract
df.select("request_address", 
          regexp_extract(df.request_address \
                           .cast("string"),"(\d{5})$", 1) \
  .alias("request_address_zip")) \
  .show(5)

+--------------------+-------------------+
|     request_address|request_address_zip|
+--------------------+-------------------+
|2315  EL PASO ST,...|              78207|
|2215  GOLIAD RD, ...|              78223|
|102  PALFREY ST W...|              78223|
|114  LA GARDE ST,...|              78223|
|734  CLEARVIEW DR...|              78228|
+--------------------+-------------------+
only showing top 5 rows



Fix the date fields to be of data type timestamp.

In [18]:
df.printSchema()

root
 |-- case_id: integer (nullable = true)
 |-- case_opened_date: string (nullable = true)
 |-- case_closed_date: string (nullable = true)
 |-- SLA_due_date: string (nullable = true)
 |-- case_late: string (nullable = true)
 |-- num_days_late: double (nullable = true)
 |-- case_closed: string (nullable = true)
 |-- dept_division: string (nullable = true)
 |-- service_request_type: string (nullable = true)
 |-- SLA_days: double (nullable = true)
 |-- case_status: string (nullable = true)
 |-- source_id: string (nullable = true)
 |-- request_address: string (nullable = true)
 |-- council_district: integer (nullable = true)



In [19]:
df = df.withColumn("case_opened_date", 
                   f.to_timestamp(f.col("case_opened_date"), 
                                  "M/d/yy H:mm")).\
        withColumn("case_closed_date", 
                   f.to_timestamp(f.col("case_closed_date"),
                                  "M/d/yy H:mm")).\
        withColumn("SLA_due_date", 
                   f.to_timestamp(f.col("SLA_due_date"),
                                  "M/d/yy H:mm"))
# We could use the withColumn method as above to add a new column or replace an existing one.

In [20]:
df.printSchema()

root
 |-- case_id: integer (nullable = true)
 |-- case_opened_date: timestamp (nullable = true)
 |-- case_closed_date: timestamp (nullable = true)
 |-- SLA_due_date: timestamp (nullable = true)
 |-- case_late: string (nullable = true)
 |-- num_days_late: double (nullable = true)
 |-- case_closed: string (nullable = true)
 |-- dept_division: string (nullable = true)
 |-- service_request_type: string (nullable = true)
 |-- SLA_days: double (nullable = true)
 |-- case_status: string (nullable = true)
 |-- source_id: string (nullable = true)
 |-- request_address: string (nullable = true)
 |-- council_district: integer (nullable = true)



Compute age

In [21]:
df = df.withColumn("case_age", 
                   f.datediff(f.current_timestamp(), 
                              "case_opened_date")). \
    withColumn("days_to_closed", 
               f.datediff("case_closed_date", 
                          "case_opened_date")).\
    withColumn("case_lifetime", 
               f.when(df["case_closed"]=="NO", 
                      f.col("case_age")) \
                .otherwise(f.col("days_to_closed")))

In [22]:
df.select("case_opened_date",
          "case_closed_date", 
          "days_to_closed",
          "case_age",
          "case_lifetime") \
  .limit(10) \
  .toPandas() \
  .head()

Unnamed: 0,case_opened_date,case_closed_date,days_to_closed,case_age,case_lifetime
0,2018-01-01 00:42:00,2018-01-01 12:29:00,0,501,0
1,2018-01-01 00:46:00,2018-01-03 08:11:00,2,501,2
2,2018-01-01 00:48:00,2018-01-02 07:57:00,1,501,1
3,2018-01-01 01:29:00,2018-01-02 08:13:00,1,501,1
4,2018-01-01 01:34:00,2018-01-01 13:29:00,0,501,0


Explore the data.

In [23]:
caseClosedFilter = df.case_closed == "YES"
caseLateFilter = df["case_late"] == "YES"

df.select("case_closed", "case_late", 
          caseClosedFilter & caseLateFilter) \
  .show(5)

+-----------+---------+-------------------------------------------+
|case_closed|case_late|((case_closed = YES) AND (case_late = YES))|
+-----------+---------+-------------------------------------------+
|        YES|       NO|                                      false|
|        YES|       NO|                                      false|
|        YES|       NO|                                      false|
|        YES|       NO|                                      false|
|        YES|      YES|                                       true|
+-----------+---------+-------------------------------------------+
only showing top 5 rows



In [24]:
df.select("case_closed", "case_late", 
          caseClosedFilter | caseLateFilter) \
  .show(5)

+-----------+---------+------------------------------------------+
|case_closed|case_late|((case_closed = YES) OR (case_late = YES))|
+-----------+---------+------------------------------------------+
|        YES|       NO|                                      true|
|        YES|       NO|                                      true|
|        YES|       NO|                                      true|
|        YES|       NO|                                      true|
|        YES|      YES|                                      true|
+-----------+---------+------------------------------------------+
only showing top 5 rows



Note the difference in how nulls are treated in the computation:
    - true & null = null
    - false & null = false
    - true | null = true
    - false | null = null

Using multiple boolean expressions in a filter

In [25]:
df.filter(caseClosedFilter & caseLateFilter) \
  .select("case_closed", "case_late") \
  .show(5)

+-----------+---------+
|case_closed|case_late|
+-----------+---------+
|        YES|      YES|
|        YES|      YES|
|        YES|      YES|
|        YES|      YES|
|        YES|      YES|
+-----------+---------+
only showing top 5 rows



is equivalent to...

In [26]:
df.filter(caseLateFilter)\
  .filter(caseClosedFilter) \
  .select("case_closed", "case_late") \
  .show(5)

+-----------+---------+
|case_closed|case_late|
+-----------+---------+
|        YES|      YES|
|        YES|      YES|
|        YES|      YES|
|        YES|      YES|
|        YES|      YES|
+-----------+---------+
only showing top 5 rows



In [27]:
spark.stop()