# Lesson 16 - Working with Columns

In [0]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

## Introduction

In this lesson, we will discuss a few tools for working with columns. In particular, we will explore how to select specific columns from a DataFrame, how to perform operations on columns, how to rename columns, and how to create new columns. We will use the gapminder dataset to illustrate these concepts.

In [0]:
gm_schema = (
    'country STRING, year INTEGER, continent STRING, population INTEGER, '
    'life_exp DOUBLE, gdp_per_cap DOUBLE, gini DOUBLE'
)

gm_df = (
  spark.read
  .option('delimiter', '\t')
  .option('header', True)
  .schema(gm_schema)
  .csv('/FileStore/tables/gapminder_data.txt')
)
    
gm_df.printSchema()

We will draw a sample of rows to use in the examples presented in this notebook.

In [0]:
gm_sample = gm_df.sample(withReplacement=False, fraction=0.0003, seed=21)
gm_sample.show()
# gm_sample.show(truncate=False) --> will show the untruncated result

## The `select()` Transformation

We can create a DataFrame containing a subset of the columns from an existing DataFrme using the `select()` transformation. We indicate the columns that we would like to select by providing strings representing the names of the columns as arguments to `select()`.

In [0]:
gm_sample.select('country', 'year', 'population').show()

The `select()` transformation can also accept as an argument a list that contains the names of the columns that we wish to select. This allows us to use the `columns` attribute to easily selected some number of consecutive columns.

In [0]:
gm_sample.select(gm_df.columns[:4]).show(3)
gm_sample.select(gm_df.columns[4:]).show(3)

We can use the wildcard character `'*'` in `select()` to select every column within a DataFrame. This might not seem useful at the moment, but we will see later on that this does have practical applications.

In [0]:
gm_sample.select('*').show()

## Column Objects

The rows of DataFrames are represented by `Row` objects and the columns of a DataFrame are represented by `Column` objects. There are many situations when it is useful, or even necessary, to work directly with column objects. The `pyspark.sql.functions` module contains two functions name `col()` and `expr()` that can be used to create column objects. 

* The `col()` function accepts a string representing the name of a column within a DataFrame. 
* The `expr()` function accepts a string representing a SQL-like expression to be applied to a column within a DataFrame. 

The simplest type of expression that can be used within `expr()` is the name of a column. We will discuss more complicated expressions later.

In [0]:
from pyspark.sql.functions import col, expr

In [0]:
gini_col = col('gini')
pop_col = expr('population')

print(type(gini_col))
print(type(pop_col))

The arguments provided to `select()` can be strings representing column names, or they can be column objects associated with specific columns. This is illustrated in the cell below.

In [0]:
gm_sample.select('country', 'year', col('gini'), expr('population') ).show()

## Column Operations

Numerical operations can be performed on column objects. The result is another column object. Additionally, we can indicate operations to be performed on the elements of a column within an expression string provided to `expr()`. Both of these approaches to performing column operations are illustrated below.

In [0]:
gini_col = col('gini') / 100
pop_col = expr('population / 1000000')

print(type(gini_col))
print(type(pop_col))

Note that column objects do not actually contain values from a column. In fact, there is nothing in our definition of the column objects above that even connect them to a specific DataFrame. Column objects contain a description of how the elements of a column should be calculated rather than containing the column values themselves. An operation represented by a column object is queued when it is supplied to the `select()` method, and is not actually performed until an action is called that requires the contents of that column.

In [0]:
gm_sample.select(
    'country', 'year', 
    col('gini') / 100, 
    expr('population / 1000000') 
).show()

## Renaming Columns

Notice that Spark named the transformed columns in a way that describes the operation used to calculate their values. This results in somewhat unsightly column names. We can rename columns generated by `select()` in two ways:

* Every column object has an `alias()` method. This method accepts a string indicating the desired name for the column. 
* We can use the `AS` keywork within an expression string provided to `expr()` to set the name of the column being created. 

Both of these techniques are illustrated in the example below.

In [0]:
gm_sample.select(
    'country', 'year', 
    (col('gini') / 100).alias('gini_ratio'), 
    expr('population / 1000000 AS pop_in_mil') 
).show()

## Operations Involving Multiple Columns

It is possible to perform calculations involving more than one column. This can be accomplished by performing operations directly on column objects representing the original columns, or by indicating the desired operation in an expression string provided to `expr()`. 

In the cell below, we will calculate the total GDP of each country in our sample by multiplying the per capita GDP by the population. For the sake of an example, we perform this calculation in two different ways.

In [0]:
gm_sample.select(
    '*',     
    (col('gdp_per_cap')*col('population')).alias('totalGDP1'),
    expr('gdp_per_cap * population AS totalGDP2')
).show()

## The `withColumn()` Transformation

We have seen that we can use `select()` to create new columns by performing operations on previously existing columns. Spark provides another transformation, named `withColumn()`, that can be used to add a column to a DataFrame. This method expects two arguments. The first argument should be a string representing the name of the new column. The second argument should be an expression that calculates the values for the new column from previously existing columns. Note that this method is a transformation that returns a DataFrame that includes the new column as well as all of the original columns.

In [0]:
gm_sample.withColumn('totalGDP', expr('gdp_per_cap * population') ).show()

It is perhaps a bit easier to use `withColumn()` if we are creating only a single column. But if we are creating multiple new columns, or want to also remove some of the previous columns, then `select()` is typically a better tool to use.