# Chapter 4 Analyzing tabular data with pyspark

This chapter cover:

- Reading delimited data into a PySpark data frame
- Understanding how PySpark represents tabular data in a data frame
- Ingesting and exploring tabular or relational data
- Selecting, manipulating, renaming, and deleting columns in a data frame
- Summarizing data frames for quick exploration

In [None]:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# change the account name to your email account
account='sli'

# define a root path to access the data in the DataAnalysisWithPythonAndPySpark
data_path='/net/clusterhn/home/'+account+'/isa460/data/'

# create a spark session
spark = (SparkSession.builder.appName("Analyzing tabluar data")
        .config("spark.port.maxRetries", "100")
        .getOrCreate())

# confiture the log level (defaulty is WWARN)
spark.sparkContext.setLogLevel('ERROR')

In [None]:
# import data from a list of lists

my_grocery_list=[
    ["Banana", 2, 1.74],
    ["Apple", 4, 2.04],
    ["Carrot", 1, 1.09],
    ["cake", 1, 10.99]
]

# create a Dataframe based on the list

df=spark.createDataFrame(my_grocery_list, ["Item", "Quantity", "Price"])

df.printSchema()

## import data from a csv file

For this exercise, we’ll use some open data from the government of Canada, more specifically the CRTC (Canadian Radio-Television and Telecommunications Commission). Every broadcaster is mandated to provide a complete log of the programs and commercials showcased to the Canadian public. This gives us a lot of potential questions to answer, but we’ll select just one:
**What are the channels with the greatest and least proportion of commercials?**

You can download the [file](http://mng.bz/y4YJ) on the Canada Open Data portal ; select the BroadcastLogs_2018_Q3_M8 file. The file is 994 MB to download, which might be too large, depending on your computer. The book’s repository contains a sample of the data under the data/broadcast_logs directory, which you can use in place of the original file. You also need to download the Data Dictionary in .doc form, as well as the Reference Tables zip file, unzipping them into a ReferenceTables directory in data/ broadcast_logs. Once again, the examples assume that the data is downloaded under data/broadcast_logs and that PySpark is launched from the root of the repository.

In [None]:
import os

directory=data_path+'/broadcast_logs/'

logs=spark.read.csv(os.path.join(directory, "BroadcastLogs_2018_Q3_M8_sample.CSV"),
                                 sep="|",
                                 header=True,
                                 inferSchema=True,
                                 timestampFormat="yyyy-MM-dd",)

In [None]:
logs.printSchema()

## Exploring the shape of our data universe

![Figure 4.4](https://raw.githubusercontent.com/Suhong88/ISA460_Fall2023/main/images/Figure%204.4.png)

## The basics of data manipulation: Selecting, dropping, renaming, ordering, diagnosing

### Select columns

In [None]:
# four ways of selecting columns
# Using the string to column conversion
logs.select("BroadCastLogID", "LogServiceID", "LogDate")

# use * to unpack a list
logs.select(*["BroadCastLogID", "LogServiceID", "LogDate"])
 
# Passing the column object explicitly
logs.select(
    F.col("BroadCastLogID"), F.col("LogServiceID"), F.col("LogDate")
)
logs.select(
    *[F.col("BroadCastLogID"), F.col("LogServiceID"), F.col("LogDate")]
)

In [None]:
# for a dataframe with a lot of columns, we can slice the columns into groups to display them by small groups



### Drop columns

In [None]:
# instead drop, you can also select the ones you want to keep
logs1 = logs.select(
    *[x for x in logs.columns if x not in ["BroadcastLogID", "SequenceNO"]]
)

### Create new columns

In [None]:
# create a column showing duration in seconds

logs.select("Duration").show(5)

In [None]:
logs.select(F.col("Duration")).dtypes

In [None]:
# step 1: extract hours, minutes and seconds



In [None]:
# step 2. merge all fields into one



In [None]:
# create a new column for duration in seconds



In [None]:
# print Schema. Why I do not see the new column?


![Warning](https://raw.githubusercontent.com/Suhong88/ISA460_Fall2023/main/images/Figure%204.5.png)

### Rename and Reordering columns

In [None]:
# change all columns to lower case


In [None]:
# order all columns in alphabetical order


# store the result into a new dataframe


### Diagnosing a data frame with describe() and summary()

In [None]:
#logs.describe().show()

# for a dataframe with a lot of column, we can describe it one by one



In [None]:
# return numerical columns


In [None]:
# apply describe to numerical columns


In [None]:
# apply summary to numerical columns


### In class Exercise: create a new data frame, logs_clean, that contains only the columns that do not end with ID.

In [None]:
import os

directory=data_path+'/broadcast_logs/'

logs=spark.read.csv(os.path.join(directory, "BroadcastLogs_2018_Q3_M8_sample.CSV"),
                                 sep="|",
                                 header=True,
                                 inferSchema=True,
                                 timestampFormat="yyyy-MM-dd",)
logs.columns

### Display a list of program title that includes word apple, remove duplidate.

### Display top 5 program title based on number of times it has been broadcasted.