# Chapter 4 Analyzing tabular data with pyspark

This chapter cover:

- Reading delimited data into a PySpark data frame
- Understanding how PySpark represents tabular data in a data frame
- Ingesting and exploring tabular or relational data
- Selecting, manipulating, renaming, and deleting columns in a data frame
- Summarizing data frames for quick exploration

In [1]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# change the account name to your email account
account='sli'

# define a root path to access the data in the DataAnalysisWithPythonAndPySpark
data_path='/net/clusterhn/home/'+account+'/isa460/data/'

spark = (SparkSession.builder.appName("Analyzing tabluar data")
        .config("spark.port.maxRetries", "100")
        .getOrCreate())

# confiture the log level (defaulty is WWARN)
spark.sparkContext.setLogLevel('ERROR')

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/23 14:19:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/09/23 14:19:43 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/09/23 14:19:43 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [2]:
# import data from a list of lists

my_grocery_list=[
    ["Banana", 2, 1.74],
    ["Apple", 4, 2.04],
    ["Carrot", 1, 1.09],
    ["cake", 1, 10.99]
]

# create a Dataframe based on the list

df=spark.createDataFrame(my_grocery_list, ["Item", "Quantity", "Price"])

df.printSchema()

root
 |-- Item: string (nullable = true)
 |-- Quantity: long (nullable = true)
 |-- Price: double (nullable = true)



## import data from a csv file

For this exercise, we’ll use some open data from the government of Canada, more specifically the CRTC (Canadian Radio-Television and Telecommunications Commission). Every broadcaster is mandated to provide a complete log of the programs and commercials showcased to the Canadian public. This gives us a lot of potential questions to answer, but we’ll select just one:
**What are the channels with the greatest and least proportion of commercials?**

You can download the [file](http://mng.bz/y4YJ) on the Canada Open Data portal ; select the BroadcastLogs_2018_Q3_M8 file. The file is 994 MB to download, which might be too large, depending on your computer. The book’s repository contains a sample of the data under the data/broadcast_logs directory, which you can use in place of the original file. You also need to download the Data Dictionary in .doc form, as well as the Reference Tables zip file, unzipping them into a ReferenceTables directory in data/ broadcast_logs. Once again, the examples assume that the data is downloaded under data/broadcast_logs and that PySpark is launched from the root of the repository.

In [3]:
import os

directory=data_path+'/broadcast_logs/'

logs=spark.read.csv(os.path.join(directory, "BroadcastLogs_2018_Q3_M8_sample.CSV"),
                                 sep="|",
                                 header=True,
                                 inferSchema=True,
                                 timestampFormat="yyyy-MM-dd",)

                                                                                

In [8]:
logs.printSchema()

root
 |-- BroadcastLogID: integer (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: date (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: date (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nullable 

## Exploring the shape of our data universe

![Figure 4.4](https://raw.githubusercontent.com/Suhong88/ISA460_Fall2023/main/images/Figure%204.4.png)

## The basics of data manipulation: Selecting, dropping, renaming, ordering, diagnosing

### Select columns

In [16]:
logs.select("BroadcastLogID", "LogServiceID", "LogDate").show(5, False)

+--------------+------------+----------+
|BroadcastLogID|LogServiceID|LogDate   |
+--------------+------------+----------+
|1196192316    |3157        |2018-08-01|
|1196192317    |3157        |2018-08-01|
|1196192318    |3157        |2018-08-01|
|1196192319    |3157        |2018-08-01|
|1196192320    |3157        |2018-08-01|
+--------------+------------+----------+
only showing top 5 rows



In [18]:
# four ways of selecting columns
# Using the string to column conversion
logs.select("BroadCastLogID", "LogServiceID", "LogDate")

# use * to unpack a list
logs.select(*["BroadCastLogID", "LogServiceID", "LogDate"])
 
# Passing the column object explicitly
logs.select(
    F.col("BroadCastLogID"), F.col("LogServiceID"), F.col("LogDate")
)
logs.select(
    *[F.col("BroadCastLogID"), F.col("LogServiceID"), F.col("LogDate")]
)

DataFrame[BroadCastLogID: int, LogServiceID: int, LogDate: date]

In [4]:
# for a dataframe with a lot of columns, we can slice the columns into groups to display them by small groups

import numpy as np

total_columns=len(logs.columns)

columns_per_row=5

group_size=total_columns//columns_per_row

column_split=np.array_split(logs.columns, group_size)

for columns in column_split:
    logs.select(*columns).show(5)

+--------------+------------+----------+----------+-------------------+
|BroadcastLogID|LogServiceID|   LogDate|SequenceNO|AudienceTargetAgeID|
+--------------+------------+----------+----------+-------------------+
|    1196192316|        3157|2018-08-01|         1|                  4|
|    1196192317|        3157|2018-08-01|         2|               null|
|    1196192318|        3157|2018-08-01|         3|               null|
|    1196192319|        3157|2018-08-01|         4|               null|
|    1196192320|        3157|2018-08-01|         5|               null|
+--------------+------------+----------+----------+-------------------+
only showing top 5 rows

+----------------------+----------+---------------+-----------------+----------------+
|AudienceTargetEthnicID|CategoryID|ClosedCaptionID|CountryOfOriginID|DubDramaCreditID|
+----------------------+----------+---------------+-----------------+----------------+
|                  null|        13|              3|               

In [5]:
# create a function to display sample records by a group of columns

import numpy as np

def displayColumnsByGroup (df, columns_per_row, sample_size=5):
    
    total_columns=len(df.columns)
    group_size=total_columns//columns_per_row

    column_split=np.array_split(df.columns, group_size)

    for columns in column_split:
        df.select(*columns).show(sample_size)    

In [6]:
# display logs file by five columns per row

displayColumnsByGroup(logs, 5)
# displayColumnsByGroup(logs, 5, 10)

+--------------+------------+----------+----------+-------------------+
|BroadcastLogID|LogServiceID|   LogDate|SequenceNO|AudienceTargetAgeID|
+--------------+------------+----------+----------+-------------------+
|    1196192316|        3157|2018-08-01|         1|                  4|
|    1196192317|        3157|2018-08-01|         2|               null|
|    1196192318|        3157|2018-08-01|         3|               null|
|    1196192319|        3157|2018-08-01|         4|               null|
|    1196192320|        3157|2018-08-01|         5|               null|
+--------------+------------+----------+----------+-------------------+
only showing top 5 rows

+----------------------+----------+---------------+-----------------+----------------+
|AudienceTargetEthnicID|CategoryID|ClosedCaptionID|CountryOfOriginID|DubDramaCreditID|
+----------------------+----------+---------------+-----------------+----------------+
|                  null|        13|              3|               

### store this function into a python file to be used later

### Drop columns

In [7]:
logs1 = logs.drop("BroadcastLogID", "SequenceNO")

# Testing if we effectively got rid of the columns
 
print("BroadcastLogID" in logs1.columns)  # => False
print("SequenceNO" in logs1.columns)  # => False

False
False


In [9]:
# instead drop, you can also select the ones you want to keep
logs1 = logs.select(
    *[c for c in logs.columns if c not in ["BroadcastLogID", "SequenceNO"]]
)

logs1.columns

['LogServiceID',
 'LogDate',
 'AudienceTargetAgeID',
 'AudienceTargetEthnicID',
 'CategoryID',
 'ClosedCaptionID',
 'CountryOfOriginID',
 'DubDramaCreditID',
 'EthnicProgramID',
 'ProductionSourceID',
 'ProgramClassID',
 'FilmClassificationID',
 'ExhibitionID',
 'Duration',
 'EndTime',
 'LogEntryDate',
 'ProductionNO',
 'ProgramTitle',
 'StartTime',
 'Subtitle',
 'NetworkAffiliationID',
 'SpecialAttentionID',
 'BroadcastOriginPointID',
 'CompositionID',
 'Producer1',
 'Producer2',
 'Language1',
 'Language2']

### Create new columns

In [6]:
# create a column showing duration in seconds

logs.select("Duration").show(5)

+----------------+
|        Duration|
+----------------+
|02:00:00.0000000|
|00:00:30.0000000|
|00:00:15.0000000|
|00:00:15.0000000|
|00:00:15.0000000|
+----------------+
only showing top 5 rows



In [10]:
logs.select(F.col("Duration")).dtypes

[('Duration', 'string')]

In [11]:
# step 1: extract hours, minutes and seconds

logs.select(F.col("Duration").substr(1,2).cast("int").alias("dur_hours"), 
            F.col("Duration").substr(4,2).cast("int").alias("dur_minutes"),
            F.col("Duration").substr(7,2).cast("int").alias("dur_seconds")).show()

+---------+-----------+-----------+
|dur_hours|dur_minutes|dur_seconds|
+---------+-----------+-----------+
|        2|          0|          0|
|        0|          0|         30|
|        0|          0|         15|
|        0|          0|         15|
|        0|          0|         15|
|        0|          0|         15|
|        0|          0|         30|
|        0|          0|         15|
|        0|          0|         15|
|        0|          0|         15|
|        0|          0|         15|
|        0|          0|         30|
|        0|          0|         30|
|        0|          0|         15|
|        0|          0|         15|
|        0|          0|         15|
|        0|          0|         15|
|        0|          0|         15|
|        0|          1|          0|
|        0|          0|         15|
+---------+-----------+-----------+
only showing top 20 rows



In [12]:
# step 2. merge all fields into one

logs.select(F.col("Duration"), (F.col("Duration").substr(1,2).cast("int").alias("dur_hours")*60*60+ 
            F.col("Duration").substr(4,2).cast("int").alias("dur_minutes")*60+
            F.col("Duration").substr(7,2).cast("int").alias("dur_seconds")).alias("Duration_seconds")).show(5)

+----------------+----------------+
|        Duration|Duration_seconds|
+----------------+----------------+
|02:00:00.0000000|            7200|
|00:00:30.0000000|              30|
|00:00:15.0000000|              15|
|00:00:15.0000000|              15|
|00:00:15.0000000|              15|
+----------------+----------------+
only showing top 5 rows



In [13]:
# create a new column for duration in seconds

logs.withColumn("Duration_seconds", F.col("Duration").substr(1,2).cast("int").alias("dur_hours")*60*60+ 
            F.col("Duration").substr(4,2).cast("int").alias("dur_minutes")*60+
            F.col("Duration").substr(7,2).cast("int").alias("dur_seconds")).select("Duration", "Duration_seconds").show(5)

+----------------+----------------+
|        Duration|Duration_seconds|
+----------------+----------------+
|02:00:00.0000000|            7200|
|00:00:30.0000000|              30|
|00:00:15.0000000|              15|
|00:00:15.0000000|              15|
|00:00:15.0000000|              15|
+----------------+----------------+
only showing top 5 rows



In [42]:
# print Schema. Why I do not see the new column?
logs.printSchema()

root
 |-- BroadcastLogID: integer (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- LogDate: date (nullable = true)
 |-- SequenceNO: integer (nullable = true)
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ProductionSourceID: integer (nullable = true)
 |-- ProgramClassID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- LogEntryDate: date (nullable = true)
 |-- ProductionNO: string (nullable = true)
 |-- ProgramTitle: string (nullable = true)
 |-- StartTime: string (nullable = true)
 |-- Subtitle: string (nullable 

In [43]:
logs1=logs.withColumn("Duration_seconds", F.col("Duration").substr(1,2).cast("int").alias("dur_hours")*60*60+ 
            F.col("Duration").substr(4,2).cast("int").alias("dur_minutes")*60+
            F.col("Duration").substr(7,2).cast("int").alias("dur_seconds"))

![Warning](https://raw.githubusercontent.com/Suhong88/ISA460_Fall2023/main/images/Figure%204.5.png)

### Rename and Reordering columns

In [14]:
logs2=logs1.withColumnRenamed("Duration_seconds", "duration_seconds")
logs2.columns

['LogServiceID',
 'LogDate',
 'AudienceTargetAgeID',
 'AudienceTargetEthnicID',
 'CategoryID',
 'ClosedCaptionID',
 'CountryOfOriginID',
 'DubDramaCreditID',
 'EthnicProgramID',
 'ProductionSourceID',
 'ProgramClassID',
 'FilmClassificationID',
 'ExhibitionID',
 'Duration',
 'EndTime',
 'LogEntryDate',
 'ProductionNO',
 'ProgramTitle',
 'StartTime',
 'Subtitle',
 'NetworkAffiliationID',
 'SpecialAttentionID',
 'BroadcastOriginPointID',
 'CompositionID',
 'Producer1',
 'Producer2',
 'Language1',
 'Language2']

In [15]:
# change all columns to lower case

logs.toDF(*[x.lower() for x in logs.columns])

DataFrame[broadcastlogid: int, logserviceid: int, logdate: date, sequenceno: int, audiencetargetageid: int, audiencetargetethnicid: int, categoryid: int, closedcaptionid: int, countryoforiginid: int, dubdramacreditid: int, ethnicprogramid: int, productionsourceid: int, programclassid: int, filmclassificationid: int, exhibitionid: int, duration: string, endtime: string, logentrydate: date, productionno: string, programtitle: string, starttime: string, subtitle: string, networkaffiliationid: int, specialattentionid: int, broadcastoriginpointid: int, compositionid: int, producer1: string, producer2: string, language1: int, language2: int]

In [16]:
# create a function to change all column names to lower caose

def columnLowerCase(df):
    return df.toDF(*[x.lower() for x in df.columns])

# add this function our helper functions

In [21]:
#logs.columns

In [22]:
#columnLowerCase(logs)

In [23]:
# order all columns in alphabetical order

logs.select(sorted(logs.columns)).printSchema()

# store the result into a new dataframe

#logs1=logs.select(sorted(logs.columns))

root
 |-- AudienceTargetAgeID: integer (nullable = true)
 |-- AudienceTargetEthnicID: integer (nullable = true)
 |-- BroadcastLogID: integer (nullable = true)
 |-- BroadcastOriginPointID: integer (nullable = true)
 |-- CategoryID: integer (nullable = true)
 |-- ClosedCaptionID: integer (nullable = true)
 |-- CompositionID: integer (nullable = true)
 |-- CountryOfOriginID: integer (nullable = true)
 |-- DubDramaCreditID: integer (nullable = true)
 |-- Duration: string (nullable = true)
 |-- EndTime: string (nullable = true)
 |-- EthnicProgramID: integer (nullable = true)
 |-- ExhibitionID: integer (nullable = true)
 |-- FilmClassificationID: integer (nullable = true)
 |-- Language1: integer (nullable = true)
 |-- Language2: integer (nullable = true)
 |-- LogDate: date (nullable = true)
 |-- LogEntryDate: date (nullable = true)
 |-- LogServiceID: integer (nullable = true)
 |-- NetworkAffiliationID: integer (nullable = true)
 |-- Producer1: string (nullable = true)
 |-- Producer2: string 

### Diagnosing a data frame with describe() and summary()

In [25]:
#logs.describe().show()

In [None]:
# for a dataframe with a lot of column, we can describe it one by one

for i in logs.columns:
    logs.select(i).describe().show()

In [None]:
# return numerical columns

numColumns=[item for item, type in logs.dtypes if type=='int']

numColumns

In [None]:
# apply describe to numerical columns
for i in numColumns:
   logs.select(i).describe().show()

## In class exercise

### 1. Create a new data frame, logs_clean, that contains only the columns that do not end with ID.

In [26]:
import os

directory=data_path+'/broadcast_logs/'

logs=spark.read.csv(os.path.join(directory, "BroadcastLogs_2018_Q3_M8_sample.CSV"),
                                 sep="|",
                                 header=True,
                                 inferSchema=True,
                                 timestampFormat="yyyy-MM-dd",)
logs.columns

['BroadcastLogID',
 'LogServiceID',
 'LogDate',
 'SequenceNO',
 'AudienceTargetAgeID',
 'AudienceTargetEthnicID',
 'CategoryID',
 'ClosedCaptionID',
 'CountryOfOriginID',
 'DubDramaCreditID',
 'EthnicProgramID',
 'ProductionSourceID',
 'ProgramClassID',
 'FilmClassificationID',
 'ExhibitionID',
 'Duration',
 'EndTime',
 'LogEntryDate',
 'ProductionNO',
 'ProgramTitle',
 'StartTime',
 'Subtitle',
 'NetworkAffiliationID',
 'SpecialAttentionID',
 'BroadcastOriginPointID',
 'CompositionID',
 'Producer1',
 'Producer2',
 'Language1',
 'Language2']

In [None]:
selected_columns=[column for column in logs.columns if not column.endswith('ID')]

logs_clean=logs.select(selected_columns)

logs_clean.columns

### 2. Display a list of program title that includes word apple, remove duplidate.

In [None]:
logs.filter(F.lower(F.col('ProgramTitle')).contains('apple')).select('ProgramTitle').distinct().show(5, False)

### 3. Display top 5 program title based on number of times it has been broadcasted.

In [None]:
logs.groupBy('ProgramTitle').count().orderBy(F.desc('count')).show(5, False)