In [1]:
from IPython.display import display, HTML
display(HTML("<style>.container {width : 99% !important;}</style>"))
display(HTML("<style>.output_result {width : 99% !important;}</style>"))
display(HTML("<style>.jp-Notebook {--jp-notebook-max-width: 99%;}/style>"))

# Introduction

## Machine Learning & Spark
1. Machine Learning & Spark

Hi! Welcome to the course on Machine Learning with Apache Spark, in which you will learn how to build Machine Learning models on large data sets using distributed computing techniques. Let's start with some fundamental concepts.

2. Building the perfect waffle (an analogy)

Suppose you wanted to teach a computer how to make waffles. You could find a good recipe and then give the computer explicit instructions about ingredients and proportions. Alternatively, you could present the computer with a selection of different waffle recipes and let it figure out the ingredients and proportions for the best recipe. The second approach is how Machine Learning works: the computer literally learns from examples.

3. Regression & classification

Machine Learning problems are generally less esoteric than finding the perfect waffle recipe. The most common problems apply either Regression or Classification. A regression model learns to predict a number. For example, when making waffles, how much flour should be used for a particular amount of sugar? A classification model, on the other hand, predicts a discrete or categorical value. For example, is a recipe calling for a particular amount of sugar and salt more likely to be for waffles or cupcakes?

4. Data in RAM

The performance of a Machine Learning model depends on data. In general, more data is a good thing. If an algorithm is able to train on a larger set of data, then its ability to generalize to new data will inevitably improve. However, there are some practical constraints. If the data can fit entirely into RAM then the algorithm can operate efficiently. What happens when those data no longer fit into memory?

5. Data exceeds RAM

The computer will start to use *virtual memory* and data will be *paged* back and forth between RAM and disk. Relative to RAM access, retrieving data from disk is slow. As the size of the data grows, paging becomes more intense and the computer begins to spend more and more time waiting for data. Performance plummets.

6. Data distributed across a cluster

How then do we deal with truly large datasets? One option is to distribute the problem across multiple computers in a cluster. Rather than trying to handle a large dataset on a single machine, it's divided up into partitions which are processed separately. Ideally each data partition can fit into RAM on a single computer in the cluster. This is the approach used by Spark.

7. What is Spark?

Spark is a general purpose framework for cluster computing. It is popular for two main reasons: 1. it's generally much faster than other Big Data technologies like Hadoop, because it does most processing in memory and 2. it has a developer-friendly interface which hides much of the complexity of distributed computing.

8. Components: nodes

Let's review the components of a Spark cluster. The cluster itself consists of one or more nodes. Each node is a computer with CPU, RAM and physical storage.

9. Components: cluster manager

A cluster manager allocates resources and coordinates activity across the cluster.

10. Components: driver

Every application running on the Spark cluster has a driver program. Using the Spark API, the driver communicates with the cluster manager, which in turn distributes work to the nodes.

11. Components: executors

On each node Spark launches an executor process which persists for the duration of the application. Work is divided up into tasks, which are simply units of computation. The executors run tasks in multiple threads across the cores in a node. When working with Spark you normally don't need to worry *too* much about the details of the cluster. Spark sets up all of that infrastructure for you and handles all interactions within the cluster. However, it's still useful to know how it works under the hood.

12. Onward!

You now have a basic understanding of the principles of Machine Learning and distributed computing with Spark. Next we'll learn how to connect to a Spark cluster.

### Characteristics of Spark
Spark is currently the most popular technology for processing large quantities of data. Not only is it able to handle enormous data volumes, but it does so very efficiently too! Also, unlike some other distributed computing technologies, developing with Spark is a pleasure.

Which of these describes Spark?

Answer the question


- Spark is a framework for cluster computing.

- Spark does most processing in memory.

- Spark has a high-level API, which conceals a lot of complexity.

- **All of the above.**

### Components in a Spark Cluster
Spark is a distributed computing platform. It achieves efficiency by distributing data and computation across a cluster of computers.

A Spark cluster consists of a number of hardware and software components which work together.

Which of these is not part of a Spark cluster?

Answer the question

- One or more nodes

- A cluster manager

- **A load balancer**

- Executors

## Connecting to Spark
1. Connecting to Spark

The previous lesson was high level overviews of Machine Learning and Spark. In this lesson you'll review the process of connecting to Spark.

2. Interacting with Spark

The connection with Spark is established by the driver, which can be written in either Java, Scala, Python or R. Each of these languages has advantages and disadvantages. Java is relatively verbose, requiring a lot of code to accomplish even simple tasks. By contrast, Scala, Python and R, are high-level languages which can accomplish much with only a small amount of code. They also offer a REPL, or Read-Evaluate-Print loop, which is crucial for interactive development. You'll be using Python.

3. Importing pyspark

Python doesn't talk natively to Spark, so we'll kick off by importing the pyspark module, which makes Spark functionality available in the Python interpreter. Spark is under vigorous development. Because the interface is evolving it's important to know what version you're working with. We'll be using version 2.4.1, which was released in March 2019.

4. Sub-modules

In addition to the main pyspark module, there are a few sub-modules which implement different aspects of the Spark interface. There are two versions of Spark Machine Learning: mllib, which uses an unstructured representation of data in RDDs and has been deprecated, and ml which is based on a structured, tabular representation of data in DataFrames. We'll be using the latter.

5. Spark URL

With the pyspark module loaded, you are able to connect to Spark. The next thing you need to do is tell Spark where the cluster is located. Here there are two options. You can either connect to a remote cluster, in which case you need to specify a Spark URL, which gives the network location of the cluster's master node. The URL is composed of an IP address or DNS name and a port number. The default port for Spark is 7077, but this must still be explicitly specified. When you're figuring out how Spark works, the infrastructure of a distributed cluster can get in the way. That's why it's useful to create a local cluster, where everything happens on a single computer. This is the setup that you're going to use throughout this course. For a local cluster, you need only specify "local" and, optionally, the number of cores to use. By default, a local cluster will run on a single core. Alternatively, you can give a specific number of cores or simply use the wildcard to choose all available cores.

6. Creating a SparkSession

You connect to Spark by creating a SparkSession object. The SparkSession class is found in the pyspark.sql sub-module. You specify the location of the cluster using the master() method. Optionally you can assign a name to the application using the appName() method. Finally you call the getOrCreate() method, which will either create a new session object or return an existing object. Once the session has been created you are able to interact with Spark. Finally, although it's possible for multiple SparkSessions to co-exist, it's good practice to stop the SparkSession when you're done.

7. Let's connect to Spark!

Great! Let's connect to Spark!

### Location of Spark master
Which of the following is not a valid way to specify the location of a Spark cluster?

Answer the question

- spark://13.59.151.161:7077

- spark://ec2-18-188-22-23.us-east-2.compute.amazonaws.com:7077

- **spark://18.188.22.23**

- local

- local[4]

- local[*]

### Creating a SparkSession
In this exercise, you'll spin up a local Spark cluster using all available cores. The cluster will be accessible via a `SparkSession` object.

The `SparkSession` class has a builder attribute, which is an instance of the Builder class. The Builder class exposes three important methods that let you:

- specify the location of the master node;
- name the application (optional); and
- retrieve an existing `SparkSession` or, if there is none, create a new one.
- The `SparkSession` class has a version attribute that gives the version of Spark. Note: The version can also be accessed via the `__version__` attribute on the `pyspark` module.

Find out more about `SparkSession` [here](https://spark.apache.org/docs/2.3.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession).

Once you are finished with the cluster, it's a good idea to shut it down, which will free up its resources, making them available for other processes.

**Note**:: You might find it useful to review the slides from the lessons in the Slides panel next to the *IPython Shell*.

**Instructions**

- Import the `SparkSession` class from `pyspark.sql`.
- Create a `SparkSession` object connected to a local cluster. Use all available cores. Name the application `'test'`.
- Use the version attribute on the `SparkSession` object to retrieve the version of Spark running on the cluster. **Note**: The version might be different from the one that's used in the presentation (it gets updated from time to time).
- Shut down the cluster.

In [2]:
# Import the SparkSession class
from pyspark.sql import SparkSession

# Create SparkSession object
spark = SparkSession.builder \
                    .master('local[*]') \
                    .appName('test') \
                    .getOrCreate()

# What version of Spark?
print(spark.version)

# Terminate the cluster
spark.stop()

3.4.0


## Loading Data
1. Loading Data

In this lesson you'll look at how to read data into Spark.

3. DataFrames: A refresher

Spark represents tabular data using the DataFrame class. The data are captured as rows (or "records"), each of which is broken down into one or more columns (or "fields"). Every column has a name and a specific data type. Some selected methods and attributes of the DataFrame class are listed here. The count() method gives the number of rows. The show() method will display a subset of rows. The printSchema() method and the dtypes attribute give different views on column types. This is really scratching the surface of what's possible with a DataFrame. You can find out more by consulting the extensive documentation.

4. CSV data for cars

CSV is a common format for storing tabular data. For illustration we'll be using a CSV file with characteristics for a selection of motor vehicles. Each line in a CSV file is a new record and within each record, fields are separated by a delimiter character, which is normally a comma. The first line is an optional header record which gives column names.

5. Reading data from CSV

Our session object has a "read" attribute which, in turn, has a csv() method which reads data from a CSV file and returns a DataFrame. The csv() method has one mandatory argument, the path to the CSV file. There are a number of optional arguments. We'll take a quick look at some of the most important ones. The header argument specifies whether or not there is a header record. The sep argument gives the field separator, which is a comma by default. There are two arguments which pertain to column data types, schema and inferSchema. Finally, the nullValue argument gives the placeholder used to indicate missing data. Let's take a look at the data we've just loaded.

6. Peek at the data

Using the show() method we can take a look at a slice of the DataFrame. The csv() method has split the data into rows and columns and picked up the column names from the header record. Looks great, doesn't it? Unfortunately there's a small snag. Before we unravel that snag, it's important to note that the first value in the cylinder column is not a number. It's the string "NA" which indicates missing data.

7. Check column types

If you check the column data types then you'll find that they are all strings. That doesn't make sense since the last six columns are clearly numbers! However, this is the expected behavior: the csv() method treats all columns as strings by default. You need to do a little more work to get the correct column types. There are two ways that you can do this: infer the column types from the data or manually specify the types.

8. Inferring column types from data

It's possible to reasonably deduce the column types by setting the inferSchema argument to True. There is a price to pay though: Spark needs to make an extra pass over the data to figure out the column types before reading the data. If the data file is big then this will increase load time notably. Using this approach all of the column types are correctly identified except for cylinder. Why? The first value in this column is "NA", so Spark thinks that the column contains strings.

9. Dealing with missing data

Missing data in CSV files are normally represented by a placeholder like the "NA" string. You can use the nullValue argument to specify the placeholder. It's always a good idea to explicitly define the missing data placeholder. The nullValue argument is case sensitive, so it's important to provide it in exactly the same form as it appears in the data file.

10. Specify column types

If inferring column type is not successful then you have the option of specifying the type of each column in an explicit schema. This also makes it possible to choose alternative column names.

11. Final cars data

This is what the final cars data look like. Note that the missing value at the top of the cylinders column is indicated by the special null constant.

12. Let's load some data!

You're ready to use what you've learned to load data from CSV files!

In [3]:
from pyspark import SparkContext
sc = SparkContext()
sc

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession(sc).builder\
.master("local[2]")\
.appName("MLSpark")\
.getOrCreate()
spark

### Loading flights data
In this exercise, you're going to load some airline flight data from a CSV file. To ensure that the exercise runs quickly these data have been trimmed down to only 50 000 records. You can get a larger dataset in the same format [here](https://assets.datacamp.com/production/repositories/3918/datasets/e1c1a03124fb2199743429e9b7927df18da3eacf/flights-larger.csv).

Notes on CSV format:

- fields are separated by a comma (this is the default separator) and
- missing data are denoted by the string 'NA'.

**Data dictionary**:

- `mon` — month (integer between 1 and 12)
- `dom` — day of month (integer between 1 and 31)
- `dow` — day of week (integer; 1 = Monday and 7 = Sunday)
- `carrier` — carrier (IATA code)
- `flight` — flight number
- `org` — origin airport (IATA code)
- `mile` — distance (miles)
- `depart` — departure time (decimal hour)
- `duration` — expected duration (minutes)
- `delay` — delay (minutes)

`pyspark` has been imported for you and the session has been initialized.

Note: The data have been aggressively down-sampled.

**Instructions**

- Read data from a CSV file called `'flights.csv'`. Assign data types to columns automatically. Deal with missing data.
- How many records are in the data?
- Take a look at the first five records.
- What data types have been assigned to the columns? Do these look correct?

In [5]:
# Read data from CSV file
flights = spark.read.csv('flights.csv',
                         sep=',',
                         header=True,
                         inferSchema=True,
                         nullValue='NA')

# Get number of records
print("The data contain %d records." % flights.count())

# View the first five records
flights.show(5)

# Check column data types
print(flights.dtypes)

The data contain 50000 records.
+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 11| 20|  6|     US|    19|JFK|2153|  9.48|     351| null|
|  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30|
|  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8|
|  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|
|  4|  2|  5|     AA|   325|ORD| 258|  8.92|      65| null|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows

[('mon', 'int'), ('dom', 'int'), ('dow', 'int'), ('carrier', 'string'), ('flight', 'int'), ('org', 'string'), ('mile', 'int'), ('depart', 'double'), ('duration', 'int'), ('delay', 'int')]


In [6]:
# Read data from CSV file
flights_full = spark.read.csv('flights-larger.csv',
                         sep=',',
                         header=True,
                         inferSchema=True,
                         nullValue='NA')

# Get number of records
print("The data contain %d records." % flights_full.count())

# View the first five records
flights_full.show(5)

# Check column data types
print(flights_full.dtypes)

The data contain 275000 records.
+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 10| 10|  1|     OO|  5836|ORD| 157|  8.18|      51|   27|
|  1|  4|  1|     OO|  5866|ORD| 466|  15.5|     102| null|
| 11| 22|  1|     OO|  6016|ORD| 738|  7.17|     127|  -19|
|  2| 14|  5|     B6|   199|JFK|2248| 21.17|     365|   60|
|  5| 25|  3|     WN|  1675|SJC| 386| 12.92|      85|   22|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows

[('mon', 'int'), ('dom', 'int'), ('dow', 'int'), ('carrier', 'string'), ('flight', 'int'), ('org', 'string'), ('mile', 'int'), ('depart', 'double'), ('duration', 'int'), ('delay', 'int')]


### Loading SMS spam data
You've seen that it's possible to infer data types directly from the data. Sometimes it's convenient to have direct control over the column types. You do this by defining an explicit schema.

The file `sms.csv` contains a selection of SMS messages which have been classified as either `'spam'` or `'ham'`. These data have been adapted from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/sms+spam+collection). There are a total of 5574 SMS, of which 747 have been labeled as spam.

**Notes on CSV format**:

- no header record and
- fields are separated by a semicolon (this is not the default separator).

**Data dictionary**:

- `id` — record identifier
- `text` — content of SMS message
- `label` — spam or ham (integer; 0 = ham and 1 = spam)

**Instructions**

- Specify the data schema, giving columns names (`"id"`, `"text"`, and `"label"`) and column types.
- Read data from a delimited file called `"sms.csv"`.
- Print the schema for the resulting DataFrame.

In [7]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Specify column names and types
schema = StructType([
    StructField("id", IntegerType()),
    StructField("text", StringType()),
    StructField("label", IntegerType())
])

# Load data from a delimited file
sms = spark.read.csv('sms.csv', sep=';', header=False, schema=schema)

# Print schema of DataFrame
sms.printSchema()

root
 |-- id: integer (nullable = true)
 |-- text: string (nullable = true)
 |-- label: integer (nullable = true)



# Classification

## Data Preparation
1. Data Preparation

In this lesson you are going to learn how to prepare data for building a Machine Learning model.

2. Do you need all of those columns?

You'll be working with the cars data again. This is what the data look like at present. There are columns for the maker and model, the origin (either USA or non-USA), the type, number of cylinders, engine size, weight, length, RPM and fuel consumption. The models that you'll be building will depend on the physical characteristics of the cars rather than the model names or manufacturers, so you'll remove the corresponding columns from the data.

3. Dropping columns

There are two approaches to doing this: either you can drop() the columns that you don't want or you can select() the fields which you do want to retain. Either way, the resulting data does not include those columns.

4. Filtering out missing data

Earlier you saw that there is a missing value in the cylinders column. Let's check to see how many other missing values there are. You'll use the filter() method and provide a logical predicate using SQL syntax which identifies NULL values. Then the count() method tells you how many records there are remaining. Just one. In this case it makes sense to simply remove the record with the missing value. There are a couple of ways that you could to do this. You could use the filter() method again with a different predicate. Or you could take a more aggressive approach and use the dropna() method to drop all records with missing values in any column. However, this should be done with care because it could result in the loss of a lot of otherwise useful data. You've now stripped down the data to what's needed to build a model.

5. Mutating columns

At present the weight and length columns are in units of pounds and inches respectively. You'll use the withColumn() method to create a new mass column in units of kilograms. The round() function is used to limit the precision of the result. You can also use the withColumn() method to replace the existing length column with values in meters. You now have mass and length in metric units.

6. Indexing categorical data

The type column consists of strings which represent six categories of vehicle type. You'll need to transform those strings into numbers. You do this using an instance of the StringIndexer class. In the constructor you provide the name of the string input column and a name for the new output column to be created. The indexer is first fit to the data, creating a StringIndexerModel. During the fitting process the distinct string values are identified and an index is assigned to each value. The model is then used to transform the data, creating a new column with the index values. By default the index values are assigned according to the descending relative frequency of each of the string values. Midsize is most common, so it gets an index of zero. Small is next most common, so its index is one. And so on. It's possible to choose different strategies for assigning index values by specifying the stringOrderType argument. Rather than using frequency of occurrence, strings can be ordered alphabetically. It's also possible to choose between ascending and descending order.

7. Indexing country of origin

You'll be building a classifier to predict whether or not a car was manufactured in the USA. So the origin column also needs to be converted from strings into numbers.

8. Assembling columns

The final step in preparing the cars data is to consolidate the various input columns into a single column. This is necessary because the Machine Learning algorithms in Spark operate on a single vector of predictors, although each element in that vector may consist of multiple values. To illustrate the process you'll start with just a pair of features, cylinders and size. First you create an instance of the VectorAssembler class, providing it with the names of the columns that you want to consolidate and the name of the new output column. The assembler is then used to transform the data. Taking a look at the relevant columns you see that the new "features" column consists of values from the cylinders and size columns consolidated into a vector. Ultimately you are going to assemble all of the predictors into a single column.

9. Let's practice!

Let's try out what we have learned on the SMS and flights data.

### Removing columns and rows
You previously loaded airline flight data from a CSV file. You're going to develop a model which will predict whether or not a given flight will be delayed.

In this exercise, you need to trim those data down by:

- removing an uninformative column and
- removing rows that do not have information about whether or not a flight was delayed.

The data are available as `flights`.

**Note**:: You might find it helpful to revise the slides from the lessons in the Slides panel next to the IPython Shell.

**Instructions**

- Remove the `flight` column.
- Find out how many records have missing values in the `delay` column.
- Remove records with missing values in the `delay` column.
- Remove records with missing values in any column and get the number of remaining rows.

In [8]:
# Remove the 'flight' column
flights_drop_column = flights.drop('flight')

# Number of records with missing 'delay' values
flights_drop_column.filter('delay IS NULL').count()

# Remove records with missing 'delay' values
flights_valid_delay = flights_drop_column.filter('delay IS NOT NULL')

# Remove records with missing values in any column and get the number of remaining rows
flights_none_missing = flights_valid_delay.dropna()
print(flights_none_missing.count())

47022


In [9]:
print(flights_drop_column.filter('delay IS NULL').count())

2978


### Column manipulation
The Federal Aviation Administration (FAA) considers a flight to be "delayed" when it arrives 15 minutes or more after its scheduled time.

The next step of preparing the flight data has two parts:

- convert the units of distance, replacing the `mile` column with a `km` column; and
- create a Boolean column indicating whether or not a flight was delayed.

**Instructions**

- Import a function which will allow you to round a number to a specific number of decimal places.
- Derive a new `km` column from the `mile` column, rounding to zero decimal places. One mile is 1.60934 km.
- Remove the `mile` column.
- Create a label column with a value of `1` indicating the delay was 15 minutes or more and `0` otherwise. Think carefully about the logical condition.

In [10]:
flights.show(5)

+---+---+---+-------+------+---+----+------+--------+-----+
|mon|dom|dow|carrier|flight|org|mile|depart|duration|delay|
+---+---+---+-------+------+---+----+------+--------+-----+
| 11| 20|  6|     US|    19|JFK|2153|  9.48|     351| null|
|  0| 22|  2|     UA|  1107|ORD| 316| 16.33|      82|   30|
|  2| 20|  4|     UA|   226|SFO| 337|  6.17|      82|   -8|
|  9| 13|  1|     AA|   419|ORD|1236| 10.33|     195|   -5|
|  4|  2|  5|     AA|   325|ORD| 258|  8.92|      65| null|
+---+---+---+-------+------+---+----+------+--------+-----+
only showing top 5 rows



In [11]:
flights_backup = flights
flights = flights_none_missing

In [12]:
# Import the required function
from pyspark.sql.functions import round

# Convert 'mile' to 'km' and drop 'mile' column (1 mile is equivalent to 1.60934 km)
flights_km = flights.withColumn('km', round(flights['mile'] * 1.60934, 0)) \
                    .drop('mile')

# Create 'label' column indicating whether flight delayed (1) or not (0)
flights_km = flights_km.withColumn('label', (flights_km.delay >= 15).cast('integer'))

# Check first five records
flights_km.show(5)

+---+---+---+-------+---+------+--------+-----+------+-----+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|
+---+---+---+-------+---+------+--------+-----+------+-----+
|  0| 22|  2|     UA|ORD| 16.33|      82|   30| 509.0|    1|
|  2| 20|  4|     UA|SFO|  6.17|      82|   -8| 542.0|    0|
|  9| 13|  1|     AA|ORD| 10.33|     195|   -5|1989.0|    0|
|  5|  2|  1|     UA|SFO|  7.98|     102|    2| 885.0|    0|
|  7|  2|  6|     AA|ORD| 10.83|     135|   54|1180.0|    1|
+---+---+---+-------+---+------+--------+-----+------+-----+
only showing top 5 rows



### Categorical columns
In the flights data there are two columns, `carrier` and `org`, which hold categorical data. You need to transform those columns into indexed numerical values.

**Instructions**

- Import the appropriate class and create an indexer object to transform the `carrier` column from a string to an numeric index.
- Prepare the indexer object on the flight data.
- Use the prepared indexer to create the numeric index column.
- Repeat the process for the org column.

In [13]:
flights_backup2 = flights
flights = flights_km

In [14]:
from pyspark.ml.feature import StringIndexer

# Create an indexer
indexer = StringIndexer(inputCol='carrier', outputCol='carrier_idx')

# Indexer identifies categories in the data
indexer_model = indexer.fit(flights)

# Indexer creates a new column with numeric index values
flights_indexed = indexer_model.transform(flights)

# Repeat the process for the other categorical feature
flights_indexed = StringIndexer(inputCol='org', outputCol='org_idx').fit(flights_indexed).transform(flights_indexed)
flights_indexed.show(5)



+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|carrier_idx|org_idx|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
|  0| 22|  2|     UA|ORD| 16.33|      82|   30| 509.0|    1|        0.0|    0.0|
|  2| 20|  4|     UA|SFO|  6.17|      82|   -8| 542.0|    0|        0.0|    1.0|
|  9| 13|  1|     AA|ORD| 10.33|     195|   -5|1989.0|    0|        1.0|    0.0|
|  5|  2|  1|     UA|SFO|  7.98|     102|    2| 885.0|    0|        0.0|    1.0|
|  7|  2|  6|     AA|ORD| 10.83|     135|   54|1180.0|    1|        1.0|    0.0|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
only showing top 5 rows



### Assembling columns
The final stage of data preparation is to consolidate all of the predictor columns into a single column.

An updated version of the `flights` data, which takes into account all of the changes from the previous few exercises, has the following predictor columns:

`mon`, `dom` and `dow`
- `carrier_idx` (indexed value from carrier)
- `org_idx` (indexed value from org)
- `km`
- `depart`
- `duration`

Note: The `truncate=False` argument to the `show()` method prevents data being truncated in the output.

**Instructions**

- Import the class which will assemble the predictors.
- Create an assembler object that will allow you to merge the predictors columns into a single column.
- Use the assembler to generate a new consolidated column.

In [15]:
flights_backup3 = flights
flights = flights_indexed
flights.show(3)

+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|carrier_idx|org_idx|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
|  0| 22|  2|     UA|ORD| 16.33|      82|   30| 509.0|    1|        0.0|    0.0|
|  2| 20|  4|     UA|SFO|  6.17|      82|   -8| 542.0|    0|        0.0|    1.0|
|  9| 13|  1|     AA|ORD| 10.33|     195|   -5|1989.0|    0|        1.0|    0.0|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
only showing top 3 rows



In [16]:
# Import the necessary class
from pyspark.ml.feature import VectorAssembler

# Create an assembler object
assembler = VectorAssembler(inputCols=['mon', 'dom', 'dow',
    'carrier_idx', 'org_idx', 'km', 'depart', 'duration'
], outputCol='features')

# Consolidate predictor columns
flights_assembled = assembler.transform(flights)

# Check the resulting column
flights_assembled.select('features', 'delay').show(5, truncate=False)

+-----------------------------------------+-----+
|features                                 |delay|
+-----------------------------------------+-----+
|[0.0,22.0,2.0,0.0,0.0,509.0,16.33,82.0]  |30   |
|[2.0,20.0,4.0,0.0,1.0,542.0,6.17,82.0]   |-8   |
|[9.0,13.0,1.0,1.0,0.0,1989.0,10.33,195.0]|-5   |
|[5.0,2.0,1.0,0.0,1.0,885.0,7.98,102.0]   |2    |
|[7.0,2.0,6.0,1.0,0.0,1180.0,10.83,135.0] |54   |
+-----------------------------------------+-----+
only showing top 5 rows



## Decision Tree
1. Decision Tree

Your first Machine Learning model will be a Decision Tree. This is probably the most intuitive model, so it seems like a good place to start.

2. Anatomy of a Decision Tree: Root node

A Decision Tree is constructed using an algorithm called "Recursive Partitioning". Consider a hypothetical example in which you build a Decision Tree to divide data into two classes, green and blue. You start by putting all of the records into the root node. Suppose that there are more green records than blue, in which case this node will be labelled "green". Now from amongst the predictors in the data you need to choose the one that will result in the most informative split of the data into two groups. Ideally you want the groups to be as homogeneous (or "pure") as possible: one should be mostly green and the other should be mostly blue.

3. Anatomy of a Decision Tree: First split

Once you have identified the most informative predictor, you split the data into two sets, labeled "green" or "blue" according to the dominant class. And this is where the recursion kicks in: you then apply exactly the same procedure on each of the child nodes, selecting the most informative predictor and splitting again.

4. Anatomy of a Decision Tree: Second split

So, for example, the green node on the left could be split again into two groups.

5. Anatomy of a Decision Tree: Third split

And the resulting green node could once again be split. The depth of each branch of the tree need not be the same. There are a variety of stopping criteria which can cause splitting to stop along a branch. For example, if the number of records in a node falls below a threshold or the purity of a node is above a threshold, then you might stop splitting. Once you have built the Decision Tree you can use it to make predictions for new data by following the splits from the root node along to the tip of a branch. The label for the final node would then be the prediction for the new data.

6. Classifying cars

Let's make this more concrete by looking at the cars data. You've transformed the country of origin column into a numeric index called 'label', with zero corresponding to cars manufactured in the USA and one for everything else. The remaining columns have all been consolidated into a column called 'features'. You want to build a Decision Tree which will use "features" to predict "label".

7. Split train/test

An important aspect of building a Machine Learning model is being able to assess how well it works. In order to do this we use the randomSplit() method to randomly split our data into two sets, a training set and a testing set. The proportions may vary, but generally you're looking at something like an 80:20 split, which means that the training set ends up having around 4 times as many records as the testing set.

8. Build a Decision Tree model

Finally the moment has come, you're going to build a Decision Tree. You start by creating a DecisionTreeClassifier() object. The next step is to fit the model to the training data by calling the fit() method.

9. Evaluating

Now that you've trained the model you can assess how effective it is by making predictions on the test set and comparing the predictions to the known values. The transform() method adds new columns to the DataFrame. The prediction column gives the class assigned by the model. You can compare this directly to the known labels in the testing data. Although the model gets the first example wrong, it's correct for the following four examples. There's also a probability column which gives the probabilities assigned to each of the outcome classes. For the first example, the model predicts that the outcome is 0 with probability 96%.

10. Confusion matrix

A good way to understand the performance of a model is to create a confusion matrix which gives a breakdown of the model predictions versus the known labels. The confusion matrix consists of four counts which are labelled as follows: - "positive" indicates a prediction of 1, while - "negative" indicates a prediction of 0 and - "true" corresponds to a correct prediction, while - "false" designates an incorrect prediction. In this case the true positives and true negatives dominate but the model still makes a number of incorrect predictions. These counts can be used to calculate the accuracy, which is the proportion of correct predictions. For our model the accuracy is 74%.

11. Let's build Decision Trees!

So, now that you know how to build a Decision Tree model with Spark, you can try that out on the flight data.

### Train/test split
To objectively assess a Machine Learning model you need to be able to test it on an independent set of data. You can't use the same data that you used to train the model: of course the model will perform (relatively) well on those data!

You will split the data into two components:

- training data (used to train the model) and
- testing data (used to test the model).

**Note**: From here on you'll be working with a smaller subset of the flights data, which just makes the exercises run more quickly.

**Instructions**

- Randomly split the `flights` data into two sets with 80:20 proportions. For repeatability set a random number seed of 43 for the split.
- Check that the training data has roughly 80% of the records from the original data.

In [17]:
flights_backup3 = flights
flights = flights_assembled

In [18]:
# Split into training and testing sets in a 80:20 ratio
flights_train, flights_test = flights.randomSplit([0.8,0.2], seed = 43)

# Check that training set has around 80% of records
training_ratio = flights_train.count() / flights.count()
print(training_ratio)

0.8025392369529156


### Build a Decision Tree
Now that you've split the flights data into training and testing sets, you can use the training set to fit a Decision Tree model.

The data are available as `flights_train` and `flights_test`.

**NOTE**: It will take a few seconds for the model to train… please be patient!

**Instructions**

- Import the class for creating a Decision Tree classifier.
- Create a classifier object and fit it to the training data.
- Make predictions for the testing data and take a look at the predictions.

In [19]:
# Import the Decision Tree Classifier class
from pyspark.ml.classification import DecisionTreeClassifier

# Create a classifier object and fit to the training data
tree = DecisionTreeClassifier()
tree_model = tree.fit(flights_train)

# Create predictions for the testing data and take a look at the predictions
prediction = tree_model.transform(flights_test)
prediction.select('label', 'prediction', 'probability').show(5, False)

+-----+----------+---------------------------------------+
|label|prediction|probability                            |
+-----+----------+---------------------------------------+
|1    |0.0       |[0.5319789315274642,0.4680210684725357]|
|1    |0.0       |[0.5319789315274642,0.4680210684725357]|
|0    |1.0       |[0.3554263565891473,0.6445736434108527]|
|1    |1.0       |[0.3554263565891473,0.6445736434108527]|
|1    |1.0       |[0.3554263565891473,0.6445736434108527]|
+-----+----------+---------------------------------------+
only showing top 5 rows



### Evaluate the Decision Tree
You can assess the quality of your model by evaluating how well it performs on the testing data. Because the model was not trained on these data, this represents an objective assessment of the model.

A confusion matrix gives a useful breakdown of predictions versus known values. It has four cells which represent the counts of:

- True Negatives (TN) — model predicts negative outcome & known outcome is negative
- True Positives (TP) — model predicts positive outcome & known outcome is positive
- False Negatives (FN) — model predicts negative outcome but known outcome is positive
- False Positives (FP) — model predicts positive outcome but known outcome is negative.
These counts (`TN`, `TP`, `FN` and `FP`) should sum to the number of records in the testing data, which is only a subset of the flights data. You can compare to the number of records in the tests data, which is `flights_test.count()`.

**Note**: These predictions are made on the testing data, so the counts are smaller than they would have been for predictions on the training data.

**Instructions**

- Create a confusion matrix by counting the combinations of label and prediction. Display the result.
- Count the number of True Negatives, True Positives, False Negatives and False Positives.

In [20]:
# Create a confusion matrix
prediction.groupBy('label', 'prediction').count().show()

# Calculate the elements of the confusion matrix
TN = prediction.filter('prediction = 0 AND label = prediction').count()
TP = prediction.filter('prediction = 1 AND label = prediction').count()
FN = prediction.filter('prediction = 0 AND label <> prediction').count()
FP = prediction.filter('prediction = 1 AND label <> prediction').count()

# Accuracy measures the proportion of correct predictions
accuracy = (TP + TN)/ flights_test.count()
print(accuracy)

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       0.0| 1231|
|    0|       0.0| 2404|
|    1|       1.0| 3532|
|    0|       1.0| 2118|
+-----+----------+-----+

0.6393107162089392


In [21]:
print(TP, TN, FP, FN)

3532 2404 2118 1231


## Logistic Regression
1. Logistic Regression

You've learned to build a Decision Tree. But it's good to have options. Logistic Regression is another commonly used classification model.

2. Logistic Curve

It uses a logistic function to model a binary target, where the target states are usually denoted by 1 and 0 or TRUE and FALSE. The maths of the model are outside the scope of this course, but this is what the logistic function looks like. For a Logistic Regression model the x-axis is a linear combination of predictor variables and the y-axis is the output of the model. Since the value of the logistic function is a number between zero and one, it's often thought of as a probability. In order to translate this number into one or other of the target states it's compared to a threshold, which is normally set at one half.

3. Logistic Curve

If the number is above the threshold then the predicted state is one.

4. Logistic Curve

Conversely, if it's below the threshold then the predicted state is zero. The model derives coefficients for each of the numerical predictors. Those coefficients might...

5. Logistic Curve

shift the curve to the right...

6. Logistic Curve

or to the left. They might make the transition between states...

7. Logistic Curve

more gradual...

8. Logistic Curve

or more rapid. These characteristics are all extracted from the training data and will vary from one set of data to another.

9. Cars revisited

Let's make this more concrete by returning to the cars data. You'll focus on the numerical predictors for the moment and return to categorical predictors later on. As before you prepare the data by consolidating the predictors into a single column and then randomly splitting the data into training and testing sets.

10. Build a Logistic Regression model

To build a Logistic Regression model you first need to import the associated class and then create a classifier object. This is then fit to the training data using the fit() method.

11. Predictions

With a trained model you are able to make predictions on the testing data. As you saw with the Decision Tree, the transform() method adds the prediction and probability columns. The probability column gives the predicted probability of each class, while the prediction column reflects the predicted label, which is derived from the probabilities by applying the threshold mentioned earlier.

12. Precision and recall

You can assess the quality of the predictions by forming a confusion matrix. The quantities in the cells of the matrix can then be used to form some informative ratios. Recall that a positive prediction indicates that a car is manufactured outside of the USA and that predictions are considered to be true or false depending on whether they are correct or not. Precision is the proportion of positive predictions which are correct. For your model, two thirds of predictions for cars manufactured outside of the USA are correct. Recall is the proportion of positive targets which are correctly predicted. Your model also identifies 80% of cars which are actually manufactured outside of the USA. Bear in mind that these metrics are based on a relatively small testing set.

13. Weighted metrics

Another way of looking at these ratios is to weight them across the positive and negative predictions. You can do this by creating an evaluator object and then calling the evaluate() method. This method accepts an argument which specifies the required metric. It's possible to request the weighted precision and recall as well as the overall accuracy. It's also possible to get the F1 metric, the harmonic mean of precision and recall, which is generally more robust than the accuracy. All of these metrics have assumed a threshold of one half. What happens if you vary that threshold?

14. ROC and AUC

A threshold is used to decide whether the number returned by the Logistic Regression model translates into either the positive or the negative class. By default that threshold is set at a half. However, this is not the only choice. Choosing a larger or smaller value for the threshold will affect the performance of the model. The ROC curve plots the true positive rate versus the false positive rate as the threshold increases from zero (top right) to one (bottom left). The AUC summarizes the ROC curve in a single number. It's literally the area under the ROC curve. AUC indicates how well a model performs across all values of the threshold. An ideal model, that performs perfectly regardless of the threshold, would have AUC of 1. In an exercise we'll see how to use another evaluator to calculate the AUC.

15. Let's do Logistic Regression!

You now know how to build a Logistic Regression model and assess the performance of that model using various metrics. Let's give this a try!

### Build a Logistic Regression model
You've already built a Decision Tree model using the flights data. Now you're going to create a Logistic Regression model on the same data.

The objective is to predict whether a flight is likely to be delayed by at least 15 minutes (label `1`) or not (label `0`).

Although you have a variety of predictors at your disposal, you'll only use the `mon`, `depart` and `duration` columns for the moment. These are numerical features which can immediately be used for a Logistic Regression model. You'll need to do a little more work before you can include categorical features. Stay tuned!

The data have been split into training and testing sets and are available as `flights_train` and `flights_test`.

**Instructions**

- Import the class for creating a Logistic Regression classifier.
- Create a classifier object and train it on the training data.
- Make predictions for the testing data and create a confusion matrix.

In [22]:
flights = flights_backup3
flights.show(5)

+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|carrier_idx|org_idx|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
|  0| 22|  2|     UA|ORD| 16.33|      82|   30| 509.0|    1|        0.0|    0.0|
|  2| 20|  4|     UA|SFO|  6.17|      82|   -8| 542.0|    0|        0.0|    1.0|
|  9| 13|  1|     AA|ORD| 10.33|     195|   -5|1989.0|    0|        1.0|    0.0|
|  5|  2|  1|     UA|SFO|  7.98|     102|    2| 885.0|    0|        0.0|    1.0|
|  7|  2|  6|     AA|ORD| 10.83|     135|   54|1180.0|    1|        1.0|    0.0|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
only showing top 5 rows



In [23]:
flights_backup4 = flights

# Create an assembler object
assembler = VectorAssembler(inputCols=['mon','depart', 'duration'], outputCol='features')


from pyspark.sql.functions import when
flights_train, flights_test = assembler.transform(flights.select('mon','depart', 'duration','delay').\
withColumn('label' , when(flights.delay > 15 , 1).otherwise(0)).\
drop('delay')).\
randomSplit([0.8,0.2], seed = 42)

In [24]:
flights_train.show(5)

+---+------+--------+-----+----------------+
|mon|depart|duration|label|        features|
+---+------+--------+-----+----------------+
|  0|  0.25|     308|    0|[0.0,0.25,308.0]|
|  0|  0.25|     308|    0|[0.0,0.25,308.0]|
|  0|  0.25|     308|    0|[0.0,0.25,308.0]|
|  0|  0.25|     308|    1|[0.0,0.25,308.0]|
|  0|  0.25|     308|    1|[0.0,0.25,308.0]|
+---+------+--------+-----+----------------+
only showing top 5 rows



In [25]:
# Import the logistic regression class
from pyspark.ml.classification import LogisticRegression

# Create a classifier object and train on training data
logistic = LogisticRegression().fit(flights_train)

# Create predictions for the testing data and show confusion matrix
prediction = logistic.transform(flights_test)
prediction.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       0.0| 1830|
|    0|       0.0| 2670|
|    1|       1.0| 2931|
|    0|       1.0| 1990|
+-----+----------+-----+



### Evaluate the Logistic Regression model
Accuracy is generally not a very reliable metric because it can be biased by the most common target class.

There are two other useful metrics:

- *precision* and
- *recall*.
Check the slides for this lesson to get the relevant expressions.

Precision is the proportion of positive predictions which are correct. For all flights which are predicted to be delayed, what proportion is actually delayed?

Recall is the proportion of positives outcomes which are correctly predicted. For all delayed flights, what proportion is correctly predicted by the model?

The precision and recall are generally formulated in terms of the positive target class. But it's also possible to calculate weighted versions of these metrics which look at both target classes.

The components of the confusion matrix are available as `TN`, `TP`, `FN` and `FP`, as well as the object `prediction`.

**Instructions**

- Find the precision and recall.
- Create a multi-class evaluator and evaluate weighted precision.
- Create a binary evaluator and evaluate `AUC` using the `"areaUnderROC"` metric.

In [26]:
TN = prediction.filter('prediction = 0 AND label = prediction').count()
TP = prediction.filter('prediction = 1 AND label = prediction').count()
FN = prediction.filter('prediction = 0 AND label <> prediction').count()
FP = prediction.filter('prediction = 1 AND label <> prediction').count()
print(TN,TP,FN,FP)

2670 2931 1830 1990


In [27]:
from pyspark.ml.evaluation import MulticlassClassificationEvaluator, BinaryClassificationEvaluator

# Calculate precision and recall
precision = TP / (TP + FP)
recall = TP / (TP + FN)
print('precision = {:.2f}\nrecall    = {:.2f}'.format(precision, recall))

# Find weighted precision
multi_evaluator = MulticlassClassificationEvaluator()
weighted_precision = multi_evaluator.evaluate(prediction, {multi_evaluator.metricName: "weightedPrecision"})

# Find AUC
binary_evaluator = BinaryClassificationEvaluator()
auc = binary_evaluator.evaluate(prediction, {binary_evaluator.metricName: 'areaUnderROC'})

precision = 0.60
recall    = 0.62


In [28]:
auc

0.6271774963423306

## Turning Text into Tables
1. Turning Text into Tables

It's said that 80% of Machine Learning is data preparation. As we'll see in this lesson, this is particularly true for text data. Before you can use Machine Learning algorithms you need to take unstructured text data and create structure, ultimately transforming the data into a table.

2. One record per document

We start with a collection of documents. These documents might be anything from a short snippet of text, like an SMS or email, to a lengthy report or book. Each document will become a record in the table.

3. One document, many columns

The text in each document will be mapped to columns in the table. First the text is split into words or tokens. You then remove short or common words that do not convey too much information. The table will then indicate the number of times that each of the remaining words occurred in the text. This table is also known as a "term-document matrix". There are some nuances to the process, but that's the central idea.

4. A selection of children's books

Suppose that your documents are the names of children's books. The raw data might look like this. Your job will be to transform these data into a table with one row per document and a column for each of the words.

5. Removing punctuation

You're interested in words, not punctuation. You'll use regular expressions (or REGEX), a mini-language for pattern matching, to remove the punctuation symbols. Regular expressions is another big topic and outside of the scope of this course, but basically you are giving a list of symbols or text pattern to match. The hyphen is escaped by the backslashes because it has another meaning in the context of regular expressions. By escaping it you tell Spark to interpret the hyphen literally. You need to specify a column name, books.text, a pattern to be matched (stored in the variable REGEX), and the replacement text, which is simply a space. You now have some double spaces but you can use REGEX to clean those up too.

6. Text to tokens

Next you split the text into words or tokens. You create a tokenizer object, giving it the name of the input column containing the text and the output column which will contain the tokens. The tokenizer is then applied to the text using the transform() method. In the results you see a new column in which each document has been transformed into a list of words. As a side effect the words have all been reduced to lower case.

7. What are stop words?

Some words occur frequently in all of the documents. These common or "stop" words convey very little information, so you will also remove them using an instance of the StopWordsRemover class. This contains a list of stop words which can be customized if necessary.

8. Removing stop words

Since you didn't give the input and output column names earlier, you specify them now and then apply the transform method. You could also have given these names when you created the remover.

9. Feature hashing

Your documents might contain a large variety of words, so in principle our table could end up with an enormous number of columns, many of which would be only sparsely populated. It would also be handy to convert the words into numbers. Enter the hashing trick, which in simple terms converts words into numbers. You create an instance of the HashingTF class, providing the names of the input and output columns. You also give the number of features, which is effectively the largest number that will be produced by the hashing trick. This needs to be sufficiently big to capture the diversity in the words. The output in the hash column is presented in sparse format, which we will talk about more later on. For the moment though it's enough to note that there are two lists. The first list contains the hashed values and the second list indicates how many times each of those values occurs. For example, in the first document the word "long" has a hash of 8 and occurs twice. Similarly, the word "five" has a hash of 6 and occurs once in each of the last two documents.

10. Dealing with common words

The final step is to account for some words occurring frequently across many documents. If a word appears in many documents then it's probably going to be less useful for building a classifier. We want to weight the number of counts for a word in a particular document against how frequently that word occurs across all documents. To do this you reduce the effective count for more common words, giving what is known as the "inverse document frequency". Inverse document frequency is generated by the IDF class, which is first fit to the hashed data and then used to generate weighted counts. The word "five", for example, occurs in multiple documents, so its effective frequency is reduced. Conversely, the word "long" only occurs in one document, so its effective frequency is increased.

11. Text ready for Machine Learning!

The inverse document frequencies are precisely what we need for building a Machine Learning model. Let's do that with the SMS data.

### Punctuation, numbers and tokens
At the end of the previous chapter you loaded a dataset of SMS messages which had been labeled as either `"spam"` (label `1`) or `"ham"` (label `0`). You're now going to use those data to build a classifier model.

But first you'll need to prepare the SMS messages as follows:

- remove punctuation and numbers
- tokenize (split into individual words)
- remove stop words
- apply the hashing trick
- convert to `TF-IDF` representation.
In this exercise you'll remove punctuation and numbers, then tokenize the messages.

The SMS data are available as `sms`.

**Instructions**

- Import the function to replace regular expressions and the feature to tokenize.
- Replace all punctuation characters from the text column with a space. Do the same for all numbers in the text column.
- Split the text column into tokens. Name the output column words.

In [29]:
sms.show(5,truncate = 50)

+---+--------------------------------------------------+-----+
| id|                                              text|label|
+---+--------------------------------------------------+-----+
|  1|                 Sorry, I'll call later in meeting|    0|
|  2|                    Dont worry. I guess he's busy.|    0|
|  3|                 Call FREEPHONE 0800 542 0578 now!|    1|
|  4|       Win a 1000 cash prize or a prize worth 5000|    1|
|  5|Go until jurong point, crazy.. Available only i...|    0|
+---+--------------------------------------------------+-----+
only showing top 5 rows



In [30]:
# Import the necessary functions
from pyspark.sql.functions import regexp_replace
from pyspark.ml.feature import Tokenizer

# Remove punctuation (REGEX provided) and numbers
wrangled = sms.withColumn('text', regexp_replace(sms.text, '[_():;,.!?\\-]', ' '))
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, '[0-9]', ' '))

# Merge multiple spaces
wrangled = wrangled.withColumn('text', regexp_replace(wrangled.text, ' +', ' '))

# Split the text into words
wrangled = Tokenizer(inputCol='text', outputCol='words').transform(wrangled)

wrangled.show(4, truncate=False)

+---+----------------------------------+-----+------------------------------------------+
|id |text                              |label|words                                     |
+---+----------------------------------+-----+------------------------------------------+
|1  |Sorry I'll call later in meeting  |0    |[sorry, i'll, call, later, in, meeting]   |
|2  |Dont worry I guess he's busy      |0    |[dont, worry, i, guess, he's, busy]       |
|3  |Call FREEPHONE now                |1    |[call, freephone, now]                    |
|4  |Win a cash prize or a prize worth |1    |[win, a, cash, prize, or, a, prize, worth]|
+---+----------------------------------+-----+------------------------------------------+
only showing top 4 rows



### Stop words and hashing
The next steps will be to remove stop words and then apply the hashing trick, converting the results into a TF-IDF.

A quick reminder about these concepts:

- The hashing trick provides a fast and space-efficient way to map a very large (possibly infinite) set of items (in this case, all words contained in the SMS messages) onto a smaller, finite number of values.
- The TF-IDF matrix reflects how important a word is to each document. It takes into account both the frequency of the word within each document but also the frequency of the word across all of the documents in the collection.
The tokenized SMS data are stored in `sms` in a column named `words`. You've cleaned up the handling of spaces in the data so that the tokenized text is neater.

**Instructions**

- Import the `StopWordsRemover`, `HashingTF` and `IDF` classes.
- Create a `StopWordsRemover` object (input column `words`, output column `terms`). Apply to `sms`.
- Create a `HashingTF` object (input results from previous step, output column `hash`). Apply to `wrangled`.
- Create an `IDF` object (input results from previous step, output column `features`). Apply to `wrangled`.

In [31]:
sms_bsckup = sms

In [32]:
sms = wrangled.select('id', 'words','label')

In [33]:
from pyspark.ml.feature import StopWordsRemover, HashingTF, IDF

# Remove stop words.
wrangled = StopWordsRemover(inputCol='words', outputCol='terms')\
      .transform(sms)

# Apply the hashing trick
wrangled = HashingTF(inputCol = 'terms', outputCol = 'hash', numFeatures=1024)\
      .transform(wrangled)

# Convert hashed symbols to TF-IDF
tf_idf = IDF(inputCol = 'hash', outputCol = 'features')\
      .fit(wrangled).transform(wrangled)
      
tf_idf.select('terms', 'features').show(4, truncate=False)

+--------------------------------+----------------------------------------------------------------------------------------------------+
|terms                           |features                                                                                            |
+--------------------------------+----------------------------------------------------------------------------------------------------+
|[sorry, call, later, meeting]   |(1024,[138,384,577,996],[2.273418200008753,3.6288353225642043,3.5890949939146903,4.104259019279279])|
|[dont, worry, guess, busy]      |(1024,[215,233,276,329],[3.9913186080986836,3.3790235241678332,4.734227298217693,4.58299632849377]) |
|[call, freephone]               |(1024,[133,138],[5.367951058306837,2.273418200008753])                                              |
|[win, cash, prize, prize, worth]|(1024,[31,47,62,389],[3.6632029660684124,4.754846585420428,4.072170704727778,7.064594791043114])    |
+--------------------------------+--------------

### Training a spam classifier
The SMS data have now been prepared for building a classifier. Specifically, this is what you have done:

- removed numbers and punctuation
- split the messages into words (or "tokens")
- removed stop words
- applied the hashing trick and
- converted to a TF-IDF representation.
Next you'll need to split the TF-IDF data into training and testing sets. Then you'll use the training data to fit a Logistic Regression model and finally evaluate the performance of that model on the testing data.

The data are stored in `sms` and `LogisticRegression` has been imported for you.

**Instructions**

- Split the data into training and testing sets in a 4:1 ratio. Set the random number seed to `13` to ensure repeatability.
- Create a `LogisticRegression` object and fit it to the training data.
- Generate predictions on the testing data.
- Use the predictions to form a confusion matrix.

In [34]:
sms_backup2 = sms

In [35]:
sms = tf_idf
sms.show(5)

+---+--------------------+-----+--------------------+--------------------+--------------------+
| id|               words|label|               terms|                hash|            features|
+---+--------------------+-----+--------------------+--------------------+--------------------+
|  1|[sorry, i'll, cal...|    0|[sorry, call, lat...|(1024,[138,384,57...|(1024,[138,384,57...|
|  2|[dont, worry, i, ...|    0|[dont, worry, gue...|(1024,[215,233,27...|(1024,[215,233,27...|
|  3|[call, freephone,...|    1|   [call, freephone]|(1024,[133,138],[...|(1024,[133,138],[...|
|  4|[win, a, cash, pr...|    1|[win, cash, prize...|(1024,[31,47,62,3...|(1024,[31,47,62,3...|
|  5|[go, until, juron...|    0|[go, jurong, poin...|(1024,[12,171,191...|(1024,[12,171,191...|
+---+--------------------+-----+--------------------+--------------------+--------------------+
only showing top 5 rows



In [36]:
# Split the data into training and testing sets
sms_train, sms_test = sms.randomSplit([0.8,0.2], seed = 13)

# Fit a Logistic Regression model to the training data
logistic = LogisticRegression(regParam=0.2).fit(sms_train)

# Make predictions on the testing data
prediction = logistic.transform(sms_test)

# Create a confusion matrix, comparing predictions to known labels
prediction.groupBy('label', 'prediction').count().show()

+-----+----------+-----+
|label|prediction|count|
+-----+----------+-----+
|    1|       0.0|   41|
|    0|       0.0|  948|
|    1|       1.0|  105|
|    0|       1.0|    2|
+-----+----------+-----+



# Regression

## One-Hot Encoding
1. One-Hot Encoding

In the last chapter you saw how to use categorical variables in a model by simply converting them to indexed numerical values. In general this is not sufficient for a regression model. Let's see why.

2. The problem with indexed values

In the cars data the type column is categorical, with six levels: 'Midsize', 'Small', 'Compact', 'Sporty', 'Large' and 'Van'. Here you can see the number of times that each of those levels occurrs in the data. You used a string indexer to assign a numerical index to each level. However, there's a problem with the index: the numbers don't have any objective meaning. The index for 'Sporty' is 3. Does it make sense to do arithmetic on that index? No. For example, it wouldn't be meaningful to add the index for 'Sporty' to the index for 'Compact'. Nor would it be valid to compare those indexes and say that 'Sporty' is larger or smaller than 'Compact'. However, a regression model works by doing precisely this: arithmetic on predictor variables. You need to convert the index values into a format in which you can perform meaningful mathematical operations.

3. Dummy variables

The first step is to create a column for each of the levels. Effectively you then place a check in the column corresponding to the value in each row. So, for example, a record with a type of 'Sporty' would have a check in the 'Sporty' column. These new columns are known as 'dummy variables'.

4. Dummy variables: binary encoding

However, rather than having checks in the dummy variable columns it makes more sense to use binary values, where a one indicates the presence of the corresponding level. It might occur to you that the volume of data has exploded. You've gone from a single column of categorical values to six binary encoded dummy variables. If there were more levels then you'd have even more columns. This could get out of hand. However, the majority of the cells in the new columns contain zeros. The non-zero values, which actually encode the information, are relatively infrequent. This effect becomes even more pronounced if there are more levels. You can exploit this by converting the data into a sparse format.

5. Dummy variables: sparse representation
02:37 - 02:49
Rather than recording the individual values, the sparse representation simply records the column numbers and value for the non-zero values.

6. Dummy variables: redundant column

You can take this one step further. Since the categorical levels are mutually exclusive you can drop one of the columns. If type is not 'Midsize', 'Small', 'Compact', 'Sporty' or 'Large' then it must be 'Van'. The process of creating dummy variables is called 'One-Hot Encoding' because only one of the columns created is ever active or 'hot'. Let's see how this is done in Spark.

7. One-hot encoding

As you might expect, there's a class for doing one-hot encoding. Import the OneHotEncoder class from the feature sub-module. When instantiating the class you need to specify the names of the input and output columns. For car type the input column is the index we defined earlier. Choose 'type_dummy' as the output column name. Note that these arguments are given as lists, so it's possible to specify multiple columns if necessary. Next fit the encoder to the data. Check how many category levels have been identified: six as expected.

8. One-hot encoding

Now that the encoder is set up it can be applied to the data by calling the transform() method. Let's take a look at the results. There's now a type_dummy column which captures the dummy variables. As mentioned earlier, the final level is treated differently. No column is assigned to type Van because if a vehicle isn't one of the other types then it must be a Van. To have a separate dummy variable for Van would be redundant. The sparse format used to represent dummy variables looks a little complicated. Let's take a moment to dig into dense versus sparse formats.

9. Dense versus sparse

Suppose that you want to store a vector which consists mostly of zeros. You could store it as a dense vector, in which each of the elements of the vector is stored explicitly. This is wasteful though because most of those elements are zeros. A sparse representation is a much better alternative. To create a sparse vector you need to specify the size of the vector (in this case, eight), the positions which are non-zero (in this case, positions zero and five, noting that we start counting at zero) and the values for each of those positions, one and seven. Sparse representation is essential for effective one-hot encoding on large data sets.

10. One-Hot Encode categoricals

Let's try out one-hot encoding on the flights data.

### Encoding flight origin
The org column in the flights data is a categorical variable giving the airport from which a flight departs.

- ORD — O'Hare International Airport (Chicago)
- SFO — San Francisco International Airport
- JFK — John F Kennedy International Airport (New York)
- LGA — La Guardia Airport (New York)
- SMF — Sacramento
- SJC — San Jose
- OGG — Kahului (Hawaii)
Obviously this is only a small subset of airports. Nevertheless, since this is a categorical variable, it needs to be one-hot encoded before it can be used in a regression model.

The data are in a variable called `flights`. You have already used a string indexer to create a column of indexed values corresponding to the strings in `org`.

You might find it useful to revise the slides from the lessons in the Slides panel next to the IPython Shell.

**Instructions**

- Import the one-hot encoder class.
- Create a one-hot encoder instance, naming the input column `org_idx` and the output column `org_dummy`.
- Apply the one-hot encoder to the flights data.
- Generate a summary of the mapping from categorical values to binary encoded dummy variables. Include only unique values and order by `org_idx`.

In [37]:
flights_backup5 = flights
flights.show(5)

+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|carrier_idx|org_idx|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
|  0| 22|  2|     UA|ORD| 16.33|      82|   30| 509.0|    1|        0.0|    0.0|
|  2| 20|  4|     UA|SFO|  6.17|      82|   -8| 542.0|    0|        0.0|    1.0|
|  9| 13|  1|     AA|ORD| 10.33|     195|   -5|1989.0|    0|        1.0|    0.0|
|  5|  2|  1|     UA|SFO|  7.98|     102|    2| 885.0|    0|        0.0|    1.0|
|  7|  2|  6|     AA|ORD| 10.83|     135|   54|1180.0|    1|        1.0|    0.0|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
only showing top 5 rows



In [38]:
#Steps seems to be done before
#flights_indexed = StringIndexer(inputCol = 'org', outputCol = 'org_idx').fit(flights).transform(flights)
#flights_indexed.show(5)

In [39]:
# Import the one hot encoder class
from pyspark.ml.feature import OneHotEncoder

# Create an instance of the one hot encoder
onehot = OneHotEncoder(inputCols=['org_idx'], outputCols=['org_dummy'])

# Apply the one hot encoder to the flights data
onehot = onehot.fit(flights)
flights_onehot = onehot.transform(flights)

# Check the results
flights_onehot.select('org', 'org_idx', 'org_dummy').distinct().orderBy('org_idx').show()

+---+-------+-------------+
|org|org_idx|    org_dummy|
+---+-------+-------------+
|ORD|    0.0|(7,[0],[1.0])|
|SFO|    1.0|(7,[1],[1.0])|
|JFK|    2.0|(7,[2],[1.0])|
|LGA|    3.0|(7,[3],[1.0])|
|SMF|    4.0|(7,[4],[1.0])|
|SJC|    5.0|(7,[5],[1.0])|
|TUS|    6.0|(7,[6],[1.0])|
|OGG|    7.0|    (7,[],[])|
+---+-------+-------------+



### Encoding shirt sizes
You have data for a consignment of t-shirts. The data includes the size of the shirt, which is given as either S, M, L or XL.

Here are the counts for the different sizes:


|size|count|
|----|-----|
|   S|    8|
|   M|   15|
|   L|   20|
|  XL|    7|

The sizes are first converted to an index using `StringIndexer` and then one-hot encoded using `OneHotEncoder`.

Which of the following is not true:

**Answer the question**

- S shirts get index 2.0 and are one-hot encoded as (3,[2],[1.0])

- M shirts get index 1.0 and are one-hot encoded as (3,[1],[1.0])

- L shirts get index 0.0 and are one-hot encoded as (3,[0],[1.0])

- **XL shirts get index 3.0 and are one-hot encoded as (3,[3],[1.0])**

## Regression
1. Regression

In the previous lesson you learned how to one-hot encode categorical features, which is essential for building regression models. In this lesson you'll find out how to build a regression model to predict numerical values.

2. Consumption versus mass: scatter

Returning to the cars data, suppose you wanted to predict fuel consumption using vehicle mass. A scatter plot is a good way to visualize the relationship between those two variables. Only a subset of the data are included in this plot, but it's clear that consumption increases with mass. However the relationship is not perfectly linear: there's scatter for individual points. A model should describe the average relationship of consumption to mass, without necessarily passing through individual points.

3. Consumption versus mass: fit

This line, for example, might describe the underlying trend in the data.

4. Consumption versus mass: alternative fits

But there are other lines which could equally well describe that trend. How do you choose the line which best describes the relationship?

5. Consumption versus mass: residuals

First we need to define the concept of residuals. The residual is the difference between the observed value and the corresponding modeled value. The residuals are indicated in the plot as the vertical lines between the data points and the model line. The best model would somehow make these residuals as small as possible.

6. Loss function

Out of all possible models, the best model is found by minimizing a loss function, which is an equation that describes how well the model fits the data. This is the equation for the mean squared error loss function. Let's quickly break it down.

7. Loss function: Observed values

You've got the observed values, y_i, …

8. Loss function: Model values

and the modeled values, \hat{y}_i. The difference between these is the residual. The residuals are squared and then summed together…

9. Loss function: Mean

before finally dividing through by the number of data points to give the mean or average. By minimizing the loss function you are effectively minimizing the average residual or the average distance between the observed and modeled values. If this looks a little complicated, don't worry: Spark will do all of the maths for you.

10. Assemble predictors

Let's build a regression model to predict fuel consumption using three predictors: mass, number of cylinders and vehicle type, where the last is a categorical which we've already one-hot encoded. As before the first step towards building a model is to take our predictors and assemble them into a single column called 'features'. The data are then randomly split into training and testing sets.

11. Build regression model

The model is created using the LinearRegression class which is imported from the regression module. By default this class expects to find the target data in a column called "label". Since you are aiming to predict the "consumption" column you need to explicitly specify the name of the label column when creating a regression object. Next train the model on the training data using the fit() method. The trained model can then be used to making predictions on the testing data using the transform() method.

12. Examine predictions

Comparing the predicted values to the known values from the testing data you'll see that there is reasonable agreement. It's hard to tell from a table though. A plot gives a clearer picture. The dashed diagonal lie represents perfect prediction. Most of the points lie close to this line, which is good.

13. Calculate RMSE

It's useful to have a single number which summarizes the performance of a model. For classifiers there are a variety of such metrics. The Root Mean Squared Error is often used for regression models. It's the square root of the Mean Squared Error, which you've already encountered, and corresponds to the standard deviation of the residuals. The metrics for a classifier, like accuracy, precision and recall, are measured on an absolute scale where it's possible to immediately identify values that are "good" or "bad". Values of RMSE are relative to the scale of the value that you're aiming to predict, so interpretation is a little more challenging. A smaller RMSE, however, always indicates better predictions.

14. Consumption versus mass: intercept

Let's examine the model. The intercept is the value predicted by the model when all predictors are zero. On the plot this is the point where the model line intersects the vertical dashed line.

15. Examine intercept

You can find this value for the model using the intercept attribute. This is the predicted fuel consumption when both mass and number of cylinders are zero and the vehicle type is 'Van'. Of course, this is an entirely hypothetical scenario: no vehicle could have zero mass!

16. Consumption versus mass: slope

There's a slope associated with each of the predictors too, which represents how rapidly the model changes when that predictor changes.

17. Examine Coefficients

The coefficients attribute gives you access to those values. There's a coefficient for each of the predictors. The coefficients for mass and number of cylinders are positive, indicating that heavier cars with more cylinders consume more fuel. These coefficients also represent the rate of change for the corresponding predictor. For example, the coefficient for mass indicates the change in fuel consumption when mass increases by one unit. Remember that there's no dummy variable for Van? The coefficients for the type dummy variables are relative to Vans. These coefficients should also be interpreted with care: if you are going to compare the values for different vehicle types then this needs to be done for fixed mass and number of cylinders. Since all of the type dummy coefficients are negative, the model indicates that, for a specific mass and number of cylinders, all other vehicle types consume less fuel than a Van. Large vehicles have the most negative coefficient, so it's possible to say that, for a specific mass and number of cylinders, Large vehicles are the most fuel efficient.

18. Regression for numeric predictions

You've covered a lot of ground in this lesson. Let's apply what you've learned to the flights data.

### Flight duration model: Just distance
In this exercise you'll build a regression model to predict flight duration (the duration column).

For the moment you'll keep the model simple, including only the distance of the flight (the `km` column) as a predictor.

The data are in `flights`. The first few records are displayed in the terminal. These data have also been split into training and testing sets and are available as `flights_train` and `flights_test`.

**Instructions**

- Create a linear regression object. Specify the name of the label column. Fit it to the training data.
- Make predictions on the testing data.
- Create a regression evaluator object and use it to evaluate `RMSE` on the testing data.

In [40]:
# No need to backup, last time we created a new variable
assembler = VectorAssembler(inputCols = ['km'], outputCol = 'features')
flights_train, flights_test = assembler.transform(flights.select('km', 'duration')).randomSplit([0.8,0.2], seed=42)
flights_train.show(5)

+-----+--------+--------+
|   km|duration|features|
+-----+--------+--------+
|108.0|      43| [108.0]|
|108.0|      43| [108.0]|
|108.0|      44| [108.0]|
|108.0|      44| [108.0]|
|108.0|      44| [108.0]|
+-----+--------+--------+
only showing top 5 rows



In [41]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Create a regression object and train on training data
regression = LinearRegression(labelCol = 'duration').fit(flights_train)

# Create predictions for the testing data and take a look at the predictions
predictions = regression.transform(flights_test)
predictions.select('duration', 'prediction').show(5, False)

# Calculate the RMSE
RegressionEvaluator(labelCol = 'duration').evaluate(predictions)

+--------+-----------------+
|duration|prediction       |
+--------+-----------------+
|44      |52.32045015618229|
|44      |52.32045015618229|
|44      |52.32045015618229|
|46      |52.32045015618229|
|46      |52.32045015618229|
+--------+-----------------+
only showing top 5 rows



16.881752082844525

### Interpreting the coefficients
The linear regression model for flight duration as a function of distance takes the form:

$ \text{duration} = \alpha + \beta \times \text{distance} $

where

- $\alpha$ intercept (component of duration which does not depend on distance) and
- $\beta$ coefficient (rate at which duration increases as a function of distance; also called the slope).
By looking at the coefficients of your model you will be able to infer

how much of the average flight duration is actually spent on the ground and
what the average speed is during a flight.
The linear regression model is available as `regression`.

**Instructions**

- What's the intercept?
- What are the coefficients? This is a vector.
- Extract the element from the vector which corresponds to the slope for distance.
- Find the average speed in km per hour.

In [42]:
# Intercept (average minutes on ground)
inter = regression.intercept
print(inter)

# Coefficients
coefs = regression.coefficients
print(coefs)

# Average minutes per km
minutes_per_km = regression.coefficients[0]
print(minutes_per_km)

# Average speed in km per hour - Divide the number of minutes in an hour by the average minutes per kilometer?
avg_speed = 60 / minutes_per_km
print(avg_speed)

44.147190398097976
[0.07567833109337331]
0.07567833109337331
792.8293229137269


### Flight duration model: Adding origin airport
Some airports are busier than others. Some airports are bigger than others too. Flights departing from large or busy airports are likely to spend more time taxiing or waiting for their takeoff slot. So it stands to reason that the duration of a flight might depend not only on the distance being covered but also on the airport from which the flight departs.

You are going to make the regression model a little more sophisticated by including the departure airport as a predictor.

These data have been split into training and testing sets and are available as `flights_train` and `flights_test`. The origin airport, stored in the `org` column, has been indexed into `org_idx`, which in turn has been one-hot encoded into `org_dummy`. The first few records are displayed in the terminal.

**Instructions**

- Fit a linear regression model to the training data.
- Make predictions for the testing data.
- Calculate the RMSE for predictions on the testing data.

In [43]:
flights_backup6 = flights

In [44]:
assembler = VectorAssembler(inputCols = ['km', 'org_dummy'], outputCol = 'features')
flights_train, flights_test = assembler.transform(flights_onehot.select('km','org_dummy', 'duration')).randomSplit([0.8,0.2], seed = 42)
flights_train.show(5)

+-----+-------------+--------+--------------------+
|   km|    org_dummy|duration|            features|
+-----+-------------+--------+--------------------+
|108.0|(7,[0],[1.0])|      43|(8,[0,1],[108.0,1...|
|108.0|(7,[0],[1.0])|      43|(8,[0,1],[108.0,1...|
|108.0|(7,[0],[1.0])|      44|(8,[0,1],[108.0,1...|
|108.0|(7,[0],[1.0])|      44|(8,[0,1],[108.0,1...|
|108.0|(7,[0],[1.0])|      44|(8,[0,1],[108.0,1...|
+-----+-------------+--------+--------------------+
only showing top 5 rows



In [45]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Create a regression object and train on training data
regression = LinearRegression(labelCol = 'duration').fit(flights_train)

# Create predictions for the testing data
predictions = regression.transform(flights_test)

# Calculate the RMSE on testing data
RegressionEvaluator(labelCol = 'duration').evaluate(predictions)

10.965027847682057

### Interpreting coefficients
Remember that origin airport, `org`, has eight possible values (`ORD`, `SFO`, `JFK`, `LGA`, `SMF`, `SJC`, `TUS` and `OGG`) which have been one-hot encoded to seven dummy variables in `org_dummy`.

The values for `km` and `org_dummy` have been assembled into `features`, which has eight columns with sparse representation. Column indices in `features` are as follows:

- 0 — `km`
- 1 — `ORD`
- 2 — `SFO`
- 3 — `JFK`
- 4 — `LGA`
- 5 — `SMF`
- 6 — `SJC` and
- 7 — `TUS`.
Note that `OGG` does not appear in this list because it is the reference level for the origin airport category.

In this exercise, you'll be using the `intercept` and `coefficients` attributes to interpret the model.

The `coefficients` attribute is a list, where the first element indicates how flight duration changes with flight distance.

**Instructions**

- Find the average speed in km per hour. This will be different to the value that you got earlier because your model is now more sophisticated.
- What's the average time on the ground at `OGG`?
- What's the average time on the ground at `JFK`?
- What's the average time on the ground at `LGA`?

In [46]:
# Average speed in km per hour
avg_speed_hour = 60 / regression.coefficients[0]
print(avg_speed_hour)

# Average minutes on ground at OGG
inter = regression.intercept
print(inter)

# Average minutes on ground at JFK
avg_ground_jfk = inter + regression.coefficients[3]
print(avg_ground_jfk)

# Average minutes on ground at LGA
avg_ground_lga = inter + regression.coefficients[4]
print(avg_ground_lga)

807.7693956642808
15.548047084068516
68.22354950340909
62.43652918727568


## Bucketing & Engineering
1. Bucketing & Engineering

The largest improvements in Machine Learning model performance are often achieved by carefully manipulating features. In this lesson you'll be learning about a few approaches to doing this.

2. Bucketing

Let's start with bucketing. It's often convenient to convert a continuous variable, like age or height, into discrete values. This can be done by assigning values to buckets or bins with well defined boundaries. The buckets might have uniform or variable width.

3. Bucketing heights

Let's make this more concrete by thinking about observations of people's heights. If you plot the heights on a histogram then it seems reasonable…

4. Bucketing heights

… to divide the heights up into ranges. To each of these ranges…

5. Bucketing heights

… you assign a label. Then you create a new column in the data…

6. Bucketing heights

… with the appropriate labels. The resulting categorical variable is often a more powerful predictor than the original continuous variable.

7. RPM histogram

Let's apply this to the cars data. Looking at the distribution of values for RPM you see that the majority lie in the range between 4500 and 6000. There are a few either below or above this range. This suggests that it would make sense to bucket these values according to those boundaries.

8. RPM buckets

You create a bucketizer object, specifying the bin boundaries as the "splits" argument and also providing the names of the input and output columns. You then apply this object to the data by calling the transform() method.

9. RPM buckets

The result has a new column with the discrete bucket values. The three buckets have been assigned index values zero, one and two, corresponding to the low, medium and high ranges for RPM.

10. One-hot encoded RPM buckets

As you saw earlier, before you can use these index values in a regression model, they first need to be one-hot encoded. The low and medium RPM ranges are mapped to distinct dummy variables, while the high range is the reference level and does not get a separate dummy variable.

11. Model with bucketed RPM

Let's look at the intercept and coefficients for a model which predicts fuel consumption based on bucketed RPM data. The intercept tells us what the fuel consumption is for the reference level, which is the high RPM bucket. To get the consumption for the low RPM bucket you add the first coefficient to the intercept. Similarly, to find the consumption for the medium RPM bucket you add the second coefficient to the intercept.

12. More feature engineering

There are many other approaches to engineering new features. It's common to apply arithmetic operations to one or more columns to create new features.

13. Mass & Height to BMI

Returning to the heights data. Suppose that we also had data for mass.

14. Mass & Height to BMI

Then it might be perfectly reasonable to engineer a new column for BMI. Potentially BMI might be a more powerful predictor than either height or mass in isolation.

15. Engineering density

Let's apply this idea to the cars data. You have columns for mass and length. Perhaps some combination of the two might be even more meaningful. You can create different forms of density by dividing the mass through by the first three powers of length. Since you only have the length of the vehicles but not their width or height, the length is being used as a proxy for these missing dimensions. In so doing you create three new predictors. The first density represents how mass changes with vehicle length. The second and third densities approximate how mass varies with the area and volume of the vehicle. Which of these will be meaningful for our model? Right now you don't know, you're just trying things out. Powerful new features are often discovered through trial and error. In the next lesson you'll learn about a technique for selecting only the relevant predictors in a regression model.

16. Let's engineer some features!

Right now though, let's apply what you've learned to the flights data.

### Bucketing departure time
Time of day data are a challenge with regression models. They are also a great candidate for bucketing.

In this lesson you will convert the flight departure times from numeric values between 0 (corresponding to 00:00) and 24 (corresponding to 24:00) to binned values. You'll then take those binned values and one-hot encode them.

**Instructions**

- Create a bucketizer object with bin boundaries at 0, 3, 6, …, 24 which correspond to times 0:00, 03:00, 06:00, …, 24:00. Specify input column as `depart` and output column as `depart_bucket`.
- Bucket the departure times in the flights data. Show the first five values for `depart` and `depart_bucket`.
- Create a one-hot encoder object. Specify output column as `depart_dummy`.
- Train the encoder on the data and then use it to convert the bucketed departure times to dummy variables. Show the first five values for depart, `depart_bucket` and `depart_dummy`.



In [47]:
from pyspark.ml.feature import Bucketizer, OneHotEncoder

# Create buckets at 3 hour intervals through the day
buckets = Bucketizer(splits=[0,3,6,9,12,15,18,21,24], inputCol = 'depart', outputCol = 'depart_bucket')

# Bucket the departure times
bucketed = buckets.transform(flights)
bucketed.select('depart', 'depart_bucket').show(5)

# Create a one-hot encoder
onehot = OneHotEncoder(inputCol = 'depart_bucket', outputCol = 'depart_dummy')

# One-hot encode the bucketed departure times
flights_onehot = onehot.fit(bucketed).transform(bucketed)
flights_onehot.select('depart', 'depart_bucket', 'depart_dummy').show(5)

+------+-------------+
|depart|depart_bucket|
+------+-------------+
| 16.33|          5.0|
|  6.17|          2.0|
| 10.33|          3.0|
|  7.98|          2.0|
| 10.83|          3.0|
+------+-------------+
only showing top 5 rows

+------+-------------+-------------+
|depart|depart_bucket| depart_dummy|
+------+-------------+-------------+
| 16.33|          5.0|(7,[5],[1.0])|
|  6.17|          2.0|(7,[2],[1.0])|
| 10.33|          3.0|(7,[3],[1.0])|
|  7.98|          2.0|(7,[2],[1.0])|
| 10.83|          3.0|(7,[3],[1.0])|
+------+-------------+-------------+
only showing top 5 rows



### Flight duration model: Adding departure time
In the previous exercise the departure time was bucketed and converted to dummy variables. Now you're going to include those dummy variables in a regression model for flight duration.

The data are in `flights`. The `km`, `org_dummy` and `depart_dummy` columns have been assembled into features, where `km` is index `0`, `org_dummy` runs from index `1` to `7` and `depart_dummy` from index 8 to 14.

The data have been split into training and testing sets and a linear regression model, `regression`, has been built on the training data. Predictions have been made on the testing data and are available as predictions.

**Instructions**

- Find the RMSE for predictions on the testing data.
- Find the average time spent on the ground for flights departing from `OGG` between 21:00 and 24:00.
- Find the average time spent on the ground for flights departing from `OGG` between 03:00 and 06:00.
- Find the average time spent on the ground for flights departing from `JFK` between 03:00 and 06:00.

Feature columns:

* 0 — km
* 1 — ORD
* 2 — SFO
* 3 — JFK
* 4 — LGA
* 5 — SJC
* 6 — SMF
* 7 — TUS
* 8 — 00:00 to 03:00
* 9 — 03:00 to 06:00
* 10 — 06:00 to 09:00
* 11 — 09:00 to 12:00
* 12 — 12:00 to 15:00
* 13 — 15:00 to 18:00
* 14 — 18:00 to 21:00

In [48]:
flights_onehot.show(5)

+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+-------------+-------------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|carrier_idx|org_idx|depart_bucket| depart_dummy|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+-------------+-------------+
|  0| 22|  2|     UA|ORD| 16.33|      82|   30| 509.0|    1|        0.0|    0.0|          5.0|(7,[5],[1.0])|
|  2| 20|  4|     UA|SFO|  6.17|      82|   -8| 542.0|    0|        0.0|    1.0|          2.0|(7,[2],[1.0])|
|  9| 13|  1|     AA|ORD| 10.33|     195|   -5|1989.0|    0|        1.0|    0.0|          3.0|(7,[3],[1.0])|
|  5|  2|  1|     UA|SFO|  7.98|     102|    2| 885.0|    0|        0.0|    1.0|          2.0|(7,[2],[1.0])|
|  7|  2|  6|     AA|ORD| 10.83|     135|   54|1180.0|    1|        1.0|    0.0|          3.0|(7,[3],[1.0])|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+-------------+-------------+
only showing top 5 

In [49]:
onehot = OneHotEncoder(inputCol = 'org_idx', outputCol = 'org_dummy')
assembler = VectorAssembler(inputCols = ['km', 'org_dummy', 'depart_dummy'], outputCol = 'features')
flights = onehot.fit(flights_onehot).transform(flights_onehot)
flights_train, flights_test = assembler.transform(flights).select('km', 'org_dummy', 'depart_dummy', 'features', 'duration').randomSplit([0.8,0.2], seed = 42)
flights_train.show(5, truncate = 50)

+-----+-------------+-------------+-----------------------------+--------+
|   km|    org_dummy| depart_dummy|                     features|duration|
+-----+-------------+-------------+-----------------------------+--------+
|108.0|(7,[0],[1.0])|    (7,[],[])|       (15,[0,1],[108.0,1.0])|      50|
|108.0|(7,[0],[1.0])|    (7,[],[])|       (15,[0,1],[108.0,1.0])|      50|
|108.0|(7,[0],[1.0])|(7,[2],[1.0])|(15,[0,1,10],[108.0,1.0,1.0])|      44|
|108.0|(7,[0],[1.0])|(7,[2],[1.0])|(15,[0,1,10],[108.0,1.0,1.0])|      44|
|108.0|(7,[0],[1.0])|(7,[2],[1.0])|(15,[0,1,10],[108.0,1.0,1.0])|      44|
+-----+-------------+-------------+-----------------------------+--------+
only showing top 5 rows



In [50]:
regression = LinearRegression(labelCol = 'duration').fit(flights_train)
predictions = regression.transform(flights_test)

In [51]:
# Find the RMSE on testing data
from pyspark.ml.evaluation import RegressionEvaluator
rmse = RegressionEvaluator(labelCol = 'duration').evaluate(predictions)
print("The test RMSE is", rmse)

# Average minutes on ground at OGG for flights departing between 21:00 and 24:00
avg_eve_ogg = regression.intercept
print(avg_eve_ogg)

# Average minutes on ground at OGG for flights departing between 03:00 and 06:00
avg_night_ogg = regression.intercept + regression.coefficients[8]
print(avg_night_ogg)

# Average minutes on ground at JFK for flights departing between 03:00 and 06:00
avg_night_jfk = regression.intercept + regression.coefficients[3] + regression.coefficients[9]
print(avg_night_jfk)

The test RMSE is 10.799923937159477
10.257324622528868
-3.3037768914902106
63.31414109793149


## Regularization
1. Regularization

The regression models that you've built up until now have blindly included all of the provided features. Next you are going to learn about a more sophisticated model which effectively selects only the most useful features.

2. Features: Only a few

A linear regression model attempts to derive a coefficient for each feature in the data. The coefficients quantify the effect of the corresponding features. More features imply more coefficients. This works well when your dataset has a few columns and many rows. You need to derive a few coefficients and you have plenty of data.

3. Features: Too many

The converse situation, many columns and few rows, is much more challenging. Now you need to calculate values for numerous coefficients but you don't have much data to do it. Even if you do manage to derive values for all of those coefficients, your model will end up being very complicated and difficult to interpret. Ideally you want to create a parsimonious model: one that has just the minimum required number of predictors. It will be as simple as possible, yet still able to make robust predictions.

4. Features: Selected

The obvious solution is to simply select the "best" subset of columns. But how to choose that subset? There are a variety of approaches to this "feature selection" problem.

5. Loss function (revisited)

In this lesson we'll be exploring one such approach to feature selection known as "penalized regression". The basic idea is that the model is penalized, or punished, for having too many coefficients. Recall that the conventional regression algorithm chooses coefficients to minimize the loss function, which is average of the squared residuals. A good model will result in low MSE because its predictions will be close to the observed values.

6. Loss function with regularization

With penalized regression an additional "regularization" or "shrinkage" term is added to the loss function. Rather than depending on the data, this term is a function of the model coefficients.

7. Regularization term

There are two standard forms for the regularization term. Lasso regression uses a term which is proportional to the absolute value of the coefficients, while Ridge regression uses the square of the coefficients. In both cases this extra term in the loss function penalizes models with too many coefficients. There's a subtle distinction between Lasso and Ridge regression. Both will shrink the coefficients of unimportant predictors. However, whereas Ridge will result in those coefficients being close to zero, Lasso will actually force them to zero precisely. It's also possible to have a mix of Lasso and Ridge. The strength of the regularization is determined by a parameter which is generally denoted by the Greek symbol lambda. When lambda = 0 there is no regularization and when lambda is large regularization completely dominates. Ideally you want to choose a value for lambda between these two extremes!

8. Cars again

Let's make this more concrete by returning to the cars data. We've assembled the mass, cylinders and type columns along with the freshly engineered density columns. We've effectively got ten predictors available for the model. As usual we'll split these data into training and testing sets.

9. Cars: Linear regression
03:54 - 04:25
Let's start by fitting a standard linear regression model to the training data. You can then make predictions on the testing data and calculate the RMSE. When you look at the model coefficients you find that all predictors have been assigned non-zero values. This means that every predictor is contributing to the model. This is certainly possible, but it's unlikely that all of the features are actually important for predicting consumption.

10. Cars: Ridge regression

Now let's fit a Ridge Regression model to the same data. You get a Ridge Regression model by giving a value of zero for elasticNetParam. An arbitrary value of 0.1 has been chosen for the regularization strength. Later you'll learn a way to choose good values for this parameter based on the data. When you calculate the RMSE on the testing data you find that it has increased slightly, but not enough to cause concern. Looking at the coefficients you see that they are all smaller than the coefficients for the standard linear regression model. They have been "shrunk".

11. Cars: Lasso regression

Finally let's build a Lasso Regression model, by setting elasticNetParam to 1. Again you find that the testing RMSE has increased, but not by a significant degree. Turning to the coefficients though, you see that something important has happened: all but two of the coefficients are now zero. There are effectively only two predictors left in the model: the dummy variable for a small type car and the linear density. Lasso Regression has identified the most important predictors and set the coefficients for the rest to zero. This tells us that we can get a good model by simply knowing whether or not a car is 'small' and it's linear density. A simpler model with no significant loss in performance.

12. Regularization ? simple model

Let's try out regularization on our flight duration model

### Flight duration model: More features!
Let's add more features to our model. This will not necessarily result in a better model. Adding some features might improve the model. Adding other features might make it worse.

More features will always make the model more complicated and difficult to interpret.

These are the features you'll include in the next model:

- `km`
- `org` (origin airport, one-hot encoded, 8 levels)
- `depart` (departure time, binned in 3 hour intervals, one-hot encoded, 8 levels)
- `dow` (departure day of week, one-hot encoded, 7 levels) and
- `mon` (departure month, one-hot encoded, 12 levels).
These have been assembled into the features column, which is a sparse representation of 32 columns (remember one-hot encoding produces a number of columns which is one fewer than the number of levels).

The data are available as `flights`, randomly split into `flights_train` and `flights_test`.

This exercise is based on a small subset of the flights data.

**Instructions**

- Fit a linear regression model to the training data.
- Generate predictions for the testing data.
- Calculate the RMSE on the testing data.
- Look at the model coefficients. Are any of them zero?

In [52]:
flights.show(5)

+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+-------------+-------------+-------------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|carrier_idx|org_idx|depart_bucket| depart_dummy|    org_dummy|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+-------------+-------------+-------------+
|  0| 22|  2|     UA|ORD| 16.33|      82|   30| 509.0|    1|        0.0|    0.0|          5.0|(7,[5],[1.0])|(7,[0],[1.0])|
|  2| 20|  4|     UA|SFO|  6.17|      82|   -8| 542.0|    0|        0.0|    1.0|          2.0|(7,[2],[1.0])|(7,[1],[1.0])|
|  9| 13|  1|     AA|ORD| 10.33|     195|   -5|1989.0|    0|        1.0|    0.0|          3.0|(7,[3],[1.0])|(7,[0],[1.0])|
|  5|  2|  1|     UA|SFO|  7.98|     102|    2| 885.0|    0|        0.0|    1.0|          2.0|(7,[2],[1.0])|(7,[1],[1.0])|
|  7|  2|  6|     AA|ORD| 10.83|     135|   54|1180.0|    1|        1.0|    0.0|          3.0|(7,[3],[1.0])|(7,[0],[1.0])|
+---+---+---+---

In [53]:
dow_indexer = StringIndexer(inputCol = 'dow', outputCol = 'dow_idx')
dow_enc = OneHotEncoder(inputCol = 'dow_idx', outputCol = 'dow_dummy')
mon_indexer = StringIndexer(inputCol = 'mon', outputCol = 'mon_idx')
mon_enc = OneHotEncoder(inputCol = 'mon_idx', outputCol = 'mon_dummy')
flights_di = dow_indexer.fit(flights).transform(flights)
flights_de = dow_enc.fit(flights_di).transform(flights_di)
flights_mi = mon_indexer.fit(flights_de).transform(flights_de)
flights_me = mon_enc.fit(flights_mi).transform(flights_mi)
flights_me.show(5)
assemb = VectorAssembler(inputCols = ['km', 'org_dummy', 'depart_dummy', 'dow_dummy', 'mon_dummy'], outputCol = 'features')
flights_train, flights_test = assemb.transform(flights_me).select('features', 'duration').randomSplit([0.8, 0.2], seed = 42)
flights_train.show(5, truncate = 70)

+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+-------------+-------------+-------------+-------+-------------+-------+---------------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|carrier_idx|org_idx|depart_bucket| depart_dummy|    org_dummy|dow_idx|    dow_dummy|mon_idx|      mon_dummy|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+-------------+-------------+-------------+-------+-------------+-------+---------------+
|  0| 22|  2|     UA|ORD| 16.33|      82|   30| 509.0|    1|        0.0|    0.0|          5.0|(7,[5],[1.0])|(7,[0],[1.0])|    3.0|(6,[3],[1.0])|    2.0| (11,[2],[1.0])|
|  2| 20|  4|     UA|SFO|  6.17|      82|   -8| 542.0|    0|        0.0|    1.0|          2.0|(7,[2],[1.0])|(7,[1],[1.0])|    2.0|(6,[2],[1.0])|    3.0| (11,[3],[1.0])|
|  9| 13|  1|     AA|ORD| 10.33|     195|   -5|1989.0|    0|        1.0|    0.0|          3.0|(7,[3],[1.0])|(7,[0],[1.0])|    1.0|(6,[1],[1.0])|   10.0|(11

In [54]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Fit linear regression model to training data
regression = LinearRegression(labelCol = 'duration').fit(flights_train)

# Make predictions on testing data
predictions = regression.transform(flights_test)

# Calculate the RMSE on testing data
rmse = RegressionEvaluator(labelCol = 'duration').evaluate(predictions)
print("The test RMSE is", rmse)

# Look at the model coefficients
coeffs = regression.coefficients
print(coeffs)

The test RMSE is 10.574746538111988
[0.07439483222682225,27.83378904120845,20.663402810401653,51.90844432181796,46.15802944524765,15.428065698164207,17.958678934937222,17.751096954538646,-13.966840313924555,1.3614790076358905,4.113124317818998,7.026364390611397,4.651969172579217,8.896825045902881,8.717426904903839,0.07914809357208828,0.10699162263365765,0.29942494302441125,-0.0685932827822467,0.539004953229117,0.1506363578726905,-3.4468262069067883,-3.5863671211600536,-1.1770108756650566,-1.312471852247298,-1.5539893789442398,-3.690083530836403,0.9059503152070464,-3.3148944652132166,-2.7817574270519803,-3.2052505334180847,-2.1621402885887337]


### Flight duration model: Regularization!
In the previous exercise you added more predictors to the flight duration model. The model performed well on testing data, but with so many coefficients it was difficult to interpret.

In this exercise you'll use Lasso regression (regularized with a L1 penalty) to create a more parsimonious model. Many of the coefficients in the resulting model will be set to zero. This means that only a subset of the predictors actually contribute to the model. Despite the simpler model, it still produces a good RMSE on the testing data.

You'll use a specific value for the regularization strength. Later you'll learn how to find the best value using cross validation.

The data (same as previous exercise) are available as `flights`, randomly split into `flights_train` and `flights_test`.

There are two parameters for this model, `λ` (regParam) and `α` (elasticNetParam), where `α` determines the type of regularization and `λ` gives the strength of regularization.

**Instructions**

- Fit a linear regression model to the training data. Set the regularization strength to 1.
- Calculate the RMSE on the testing data.
- Look at the model coefficients.
- How many of the coefficients are equal to zero?

In [55]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

# Fit Lasso model (λ = 1, α = 1) to training data
regression = LinearRegression(labelCol = 'duration', regParam = 1, elasticNetParam=1).fit(flights_train)

# Calculate the RMSE on testing data
rmse = RegressionEvaluator(labelCol = 'duration').evaluate(regression.transform(flights_test))
print("The test RMSE is", rmse)

# Look at the model coefficients
coeffs = regression.coefficients
print(coeffs)

# Number of zero coefficients
zero_coeff = sum([beta for beta in regression.coefficients])
print("Number of coefficients equal to 0:", zero_coeff)

The test RMSE is 11.55850218375903
[0.07348295337481496,5.6617372847641265,0.0,28.75266895221526,21.868371084613983,-2.3612686725471783,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0974936889170492,1.0833419175529913,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0]
Number of coefficients equal to 0: 56.17582720889105


# Ensembles & Pipelines

## Pipeline
1. Pipeline

Welcome back! So far you've learned how to build classifier and regression models using Spark. In this chapter you'll learn how to make those models better. You'll start by taking a look at pipelines, which will seriously streamline your workflow. They will also help to ensure that training and testing data are treated consistently and that no leakage of information between these two sets takes place.

2. Leakage?

What do I mean by leakage? Most of the actions you've been using involve both a fit() and a transform() method. Those methods have been applied in a fairly relaxed way. But to get really robust results you need to be careful only to apply the fit() method to training data. Why? Because if a fit() method is applied to *any* of the testing data then the model will effectively have seen those data during the training phase, so the results of testing will no longer be objective. The transform() method, on the other hand, can be applied to both training and testing data since it does not result in any changes in the underlying model.

3. A leaky model

A figure should make this clearer. Leakage occurs whenever a fit() method is applied to testing data. Suppose that you fit a model using both the training and testing data. The model would then already have *seen* the testing data, so using those data to test the model would not be fair: of course the model will perform well on data which has been used for training! This sounds obvious, but care must be taken not to fall into this trap. Remember that there are normally multiple stages in building a model and if the fit() method in *any* of those stages is applied to the testing data then the model is compromised.

4. A watertight model

However, if you are careful to only apply fit() to the training data then your model will be in good shape. When it comes to testing it will not have seen *any* of the testing data and the test results will be completely objective. Luckily a pipeline will make it easier to avoid leakage because it simplifies the training and testing process.

5. Pipeline

A pipeline is a mechanism to combine a series of steps. Rather than applying each of the steps individually, they are all grouped together and applied as a single unit.

6. Cars model: Steps

Let's return to our cars regression model. Recall that there were a number of steps involved: - using a string indexer to convert the type column to indexed values; - applying a one-hot encoder to convert those indexed values into dummy variables; then - assembling a set of predictors into a single features column; and finally - building a regression model.

7. Cars model: Applying steps

Let's map out the process of applying those steps. - First you fit the indexer to the training data. Then you call the transform() method on the training data to add the indexed column. - Then you call the transform() method on the testing data to add the indexed column there too. Note that the testing data was not used to fit the indexer. Next you do the same things for the one-hot encoder, fitting to the training data and then using the fitted encoder to update the training and testing data sets. The assembler is next. In this case there is no fit() method, so you simply apply the transform() method to the training and testing data. Finally the data are ready. You fit the regression model to the training data and then use the model to make predictions on the testing data. Throughout the process you've been careful to keep the testing data out of the training process. But this is hard work and it's easy enough to slip up.

8. Cars model: Pipeline

A pipeline makes training and testing a complicated model a lot easier. The Pipeline class lives in the ml sub-module. You create a pipeline by specifying a sequence of stages, where each stage corresponds to a step in the model building process. The stages are executed in order. Now, rather than calling the fit() and transform() methods for each stage, you simply call the fit() method for the pipeline on the training data. Each of the stages in the pipeline is then automatically applied to the training data in turn. This will systematically apply the fit() and transform() methods for each stage in the pipeline. The trained pipeline can then be used to make predictions on the testing data by calling its transform() method. The pipeline transform() method will only call the transform() method for each of the stages in the pipeline. Isn't that simple?

9. Cars model: Stages

You can access the stages in the pipeline by using the .stages attribute, which is a list. You pick out individual stages by indexing into the list. For example, to access the regression component of the pipeline you'd use an index of 3. Having access to that component makes it possible to get the intercept and coefficients for the trained LinearRegression model.

10. Pipelines streamline workflow!

Pipelines make your code easier to read and maintain. Let's try them out with our flights model.

### Flight duration model: Pipeline stages
You're going to create the stages for the flights duration model pipeline. You will use these in the next exercise to build a pipeline and to create a regression model.

The `StringIndexer`, `OneHotEncoder`, `VectorAssembler` and `LinearRegression` classes are already imported.

**Instructions**

- Create an indexer to convert the `'org'` column into an indexed column called `'org_idx'`.
- Create a one-hot encoder to convert the `'org_idx'` and `'dow'` columns into dummy variable columns called `'org_dummy'` and `'dow_dummy'`.
- Create an assembler which will combine the `'km'` column with the two dummy variable columns. The output column should be called `'features'`.
- Create a linear regression object to predict flight duration.

You might find it useful to revisit the slides from the lessons in the Slides panel next to the IPython Shell.

In [56]:
# Convert categorical strings to index values
indexer = StringIndexer(inputCol = 'org', outputCol = 'org_idx')

# One-hot encode index values
onehot = OneHotEncoder(
    inputCols=['org_idx', 'dow'],
    outputCols=['org_dummy', 'dow_dummy']
)

# Assemble predictors into a single column
assembler = VectorAssembler(inputCols=['km','org_dummy','dow_dummy'], outputCol='features')

# A linear regression object
regression = LinearRegression(labelCol='duration')

### Flight duration model: Pipeline model
You're now ready to put those stages together in a pipeline.

You'll construct the pipeline and then train the pipeline on the training data. This will apply each of the individual stages in the pipeline to the training data in turn. None of the stages will be exposed to the testing data at all: there will be no leakage!

Once the entire pipeline has been trained it will then be used to make predictions on the testing data.

The data are available as `flights`, which has been randomly split into `flights_train` and `flights_test`.

**Instructions**

- Import the class for creating a pipeline.
- Create a pipeline object and specify the `indexer`, `onehot`, `assembler` and `regression` stages, in this order.
- Train the pipeline on the training data.
- Make predictions on the testing data.

In [57]:
flights.show(5)

+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+-------------+-------------+-------------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|carrier_idx|org_idx|depart_bucket| depart_dummy|    org_dummy|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+-------------+-------------+-------------+
|  0| 22|  2|     UA|ORD| 16.33|      82|   30| 509.0|    1|        0.0|    0.0|          5.0|(7,[5],[1.0])|(7,[0],[1.0])|
|  2| 20|  4|     UA|SFO|  6.17|      82|   -8| 542.0|    0|        0.0|    1.0|          2.0|(7,[2],[1.0])|(7,[1],[1.0])|
|  9| 13|  1|     AA|ORD| 10.33|     195|   -5|1989.0|    0|        1.0|    0.0|          3.0|(7,[3],[1.0])|(7,[0],[1.0])|
|  5|  2|  1|     UA|SFO|  7.98|     102|    2| 885.0|    0|        0.0|    1.0|          2.0|(7,[2],[1.0])|(7,[1],[1.0])|
|  7|  2|  6|     AA|ORD| 10.83|     135|   54|1180.0|    1|        1.0|    0.0|          3.0|(7,[3],[1.0])|(7,[0],[1.0])|
+---+---+---+---

In [58]:
flights_backup3.show(5)

+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|carrier_idx|org_idx|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
|  0| 22|  2|     UA|ORD| 16.33|      82|   30| 509.0|    1|        0.0|    0.0|
|  2| 20|  4|     UA|SFO|  6.17|      82|   -8| 542.0|    0|        0.0|    1.0|
|  9| 13|  1|     AA|ORD| 10.33|     195|   -5|1989.0|    0|        1.0|    0.0|
|  5|  2|  1|     UA|SFO|  7.98|     102|    2| 885.0|    0|        0.0|    1.0|
|  7|  2|  6|     AA|ORD| 10.83|     135|   54|1180.0|    1|        1.0|    0.0|
+---+---+---+-------+---+------+--------+-----+------+-----+-----------+-------+
only showing top 5 rows



In [59]:
flights_test, flights_train = flights_backup3.drop('carrier_idx').drop('org_idx').randomSplit([0.8,0.2], seed = 42)
flights_train.show(5)

+---+---+---+-------+---+------+--------+-----+------+-----+
|mon|dom|dow|carrier|org|depart|duration|delay|    km|label|
+---+---+---+-------+---+------+--------+-----+------+-----+
|  0|  1|  2|     AA|JFK|  12.0|     370|   11|3983.0|    0|
|  0|  1|  2|     AA|LGA|  9.92|     170|   -9|1180.0|    0|
|  0|  1|  2|     AA|LGA| 20.42|     185|   31|1765.0|    1|
|  0|  1|  2|     AA|ORD|  9.08|     560|   39|6828.0|    1|
|  0|  1|  2|     AA|ORD| 14.08|     270|   20|3335.0|    1|
+---+---+---+-------+---+------+--------+-----+------+-----+
only showing top 5 rows



In [60]:
# Import class for creating a pipeline
from pyspark.ml import Pipeline

# Construct a pipeline
pipeline = Pipeline(stages=[indexer, onehot, assembler, regression])

# Train the pipeline on the training data
pipeline = pipeline.fit(flights_train)

# Make predictions on the testing data
predictions = pipeline.transform(flights_test)

### SMS spam pipeline
You haven't looked at the SMS data for quite a while. Last time we did the following:

- split the text into tokens
- removed stop words
- applied the hashing trick
- converted the data from counts to IDF and
- trained a logistic regression model.
Each of these steps was done independently. This seems like a great application for a pipeline!

The `Pipeline` and `LogisticRegression` classes have already been imported into the session, so you don't need to worry about that!

**Instructions**

- Create an object for splitting text into tokens.
- Create an object to remove stop words. Rather than explicitly giving the input column name, use the `getOutputCol()` method on the previous object.
- Create objects for applying the hashing trick and transforming the data into a TF-IDF. Use the `getOutputCol()` method again.
- Create a pipeline which wraps all of the above steps as well as an object to create a Logistic Regression model.

In [61]:
from pyspark.ml.feature import Tokenizer, StopWordsRemover, HashingTF, IDF

# Break text into tokens at non-word characters
tokenizer = Tokenizer(inputCol='text', outputCol='words')

# Remove stop words
remover = StopWordsRemover(inputCol=tokenizer.getOutputCol(), outputCol='terms')

# Apply the hashing trick and transform to TF-IDF
hasher = HashingTF(inputCol=remover.getOutputCol(), outputCol="hash")
idf = IDF(inputCol=hasher.getOutputCol(), outputCol="features")

# Create a logistic regression object and add everything to a pipeline
logistic = LogisticRegression()
pipeline = Pipeline(stages=[tokenizer, remover, hasher, idf, logistic])

## Cross-Validation
1. Cross-Validation

Up until now you've been testing models using a rather simple technique: randomly splitting the data into training and testing sets, training the model on the training data and then evaluating its performance on the testing set. There's one major drawback to this approach: you only get one estimate of the model performance. You would have a more robust idea of how well a model works if you were able to test it multiple times. This is precisely the idea behind cross-validation.

2. CV - complete data

You start out with the full set of data.

3. CV - train/test split

You still split these data into a training set and a testing set. Remember that before splitting it's important to first randomize the data so that the distributions in the training and testing data are similar.

4. CV - multiple folds

You then split the training data into a number of partitions or "folds". The number of folds normally factors into the name of the technique. For example, if you split into five folds then you'd talk about 5-fold cross-validation.

5. Fold upon fold - first fold

Once the training data have been split into folds you can start cross-validating. First keep aside the data in the first fold. Train a model on the remaining four folds. Then evaluate that model on the data from the first fold. This will give the first value for the evaluation metric.

6. Fold upon fold - second fold

Next you move onto the second fold, where the same process is repeated: data in the second fold are set aside for testing while the remaining four folds are used to train a model. That model is tested on the second fold data, yielding the second value for the evaluation metric.

7. Fold upon fold - other folds

You repeat the process for the remaining folds. Each of the folds is used in turn as testing data and you end up with as many values for the evaluation metric as there are folds. At this point you are in a position to calculate the average of the evaluation metric over all folds, which is a much more robust measure of model performance than a single value.

8. Cars revisited

Let's see how this works in practice. Remember the cars data? Of course you do. You're going to build a cross-validated regression model to predict consumption.

9. Estimator and evaluator

Here are the first two ingredients which you need to perform cross-validation: - an estimator, which builds the model and is often a pipeline; and - an evaluator, which quantifies how well a model works on testing data. We've seen both of these a few times already.

10. Grid and cross-validator

Now the final ingredients. You'll need two new classes, CrossValidator and ParamGridBuilder, both from the tuning sub-module. You'll create a parameter grid, which you'll leave empty for the moment, but will return to in detail during the next lesson. Finally you have everything required to create a cross-validator object: - an estimator, which is the linear regression model, - an empty grid of parameters for the estimator and - an evaluator which will calculate the RMSE. You can optionally specify the number of folds (which defaults to three) and a random number seed for repeatability.

11. Cross-validators need training too

The cross-validator has a fit() method which will apply the cross-validation procedure to the training data. You can then look at the average RMSE calculated across all of the folds. This is a more robust measure of model performance because it is based on multiple train/test splits. Note that the average metric is returned as a list. You'll see why in the next lesson.

12. Cross-validators act like models

The trained cross-validator object acts just like any other model. It has a transform method, which can be used to make predictions on new data. If we evaluate the predictions on the original testing data then we find a smaller value for the RMSE than we obtained using cross-validation. This means that a simple train-test split would have given an overly optimistic view on model performance.

13. Cross-validate all the models!

Let's give cross-validation a try on our flights model.

### Cross validating simple flight duration model
You've already built a few models for predicting flight duration and evaluated them with a simple train/test split. However, cross-validation provides a much better way to evaluate model performance.

In this exercise you're going to train a simple model for flight duration using cross-validation. Travel time is usually strongly correlated with distance, so using the `km` column alone should give a decent model.

The data have been randomly split into `flights_train` and `flights_test`.

The following classes have already been imported: `LinearRegression`, `RegressionEvaluator`, `ParamGridBuilder` and `CrossValidator`.

**Instructions**

- Create an empty parameter grid.
- Create objects for building and evaluating a linear regression model. The model should predict the `"duration"` field.
- Create a cross-validator object. Provide values for the `estimator`, `estimatorParamMaps` and `evaluator` arguments. Choose 5-fold cross validation.
- Train and test the model across multiple folds of the training data.

In [62]:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import RegressionEvaluator

In [63]:
assemb = VectorAssembler(inputCols = ['km'], outputCol = 'features')
flights_train = assemb.transform(flights_train.drop('features'))
flights_test = assemb.transform(flights_test.drop('features'))

In [64]:
# Create an empty parameter grid
params = ParamGridBuilder().build()

# Create objects for building and evaluating a regression model
regression = LinearRegression(labelCol = 'duration')
evaluator = RegressionEvaluator(labelCol = 'duration')

# Create a cross validator
cv = CrossValidator(estimator=regression, estimatorParamMaps=params, evaluator=evaluator, numFolds = 5)

# Train and test model on multiple folds of the training data
cv = cv.fit(flights_train)

# NOTE: Since cross-valdiation builds multiple models, the fit() method can take a little while to complete.

### Cross validating flight duration model pipeline
The cross-validated model that you just built was simple, using km alone to predict duration.

Another important predictor of flight duration is the origin airport. Flights generally take longer to get into the air from busy airports. Let's see if adding this predictor improves the model!

In this exercise you'll add the `org` field to the model. However, since `org` is categorical, there's more work to be done before it can be included: it must first be transformed to an index and then one-hot encoded before being assembled with `km` and used to build the regression model. We'll wrap these operations up in a pipeline.

The following objects have already been created:

- `params` — an empty parameter grid
- `evaluator` — a regression evaluator
- `regression` — a LinearRegression object with labelCol='duration'.

The `StringIndexer`, `OneHotEncoder`, `VectorAssembler` and `CrossValidator` classes have already been imported.

**Instructions**

- Create a string indexer. Specify the input and output fields as `org` and `org_idx`.
- Create a one-hot encoder. Name the output field `org_dummy`.
- Assemble the `km` and `org_dummy` fields into a single field called `features`.
- Create a pipeline using the following operations: string indexer, one-hot encoder, assembler and linear regression. Use this to create a cross-validator.

In [65]:
# Create an indexer for the org field
indexer = StringIndexer(inputCol = 'org', outputCol = 'org_idx')

# Create an one-hot encoder for the indexed org field
onehot = OneHotEncoder(inputCols = ['org_idx'], outputCols = ['org_dummy'])

# Assemble the km and one-hot encoded fields
assembler = VectorAssembler(inputCols = ['km', 'org_dummy'], outputCol = 'features')

# Create a pipeline and cross-validator.
pipeline = Pipeline(stages=[indexer, onehot, assembler, regression])
cv = CrossValidator(estimator=pipeline,
          estimatorParamMaps=params,
          evaluator=evaluator)

## Grid Search
1. Grid Search

So far you've been using the default parameters for almost everything. You've built some decent models, but they could probably be improved by choosing better model parameters.

2. Tuning

There is no universal "best" set of parameters for a particular model. The optimal choice of parameters will depend on the data and the modeling goal. The idea is relatively simple, you build a selection of models, one for each set of model parameters. Then you evaluate those models and choose the best one.

3. Cars revisited (again)

You'll be looking at the fuel consumption regression model again.

4. Fuel consumption with intercept

You'll start by doing something simple, comparing a linear regression model with an intercept to one that passes through the origin. By default a linear regression model will always fit an intercept, but you're going to be explicit and specify the fitIntercept parameter as True. You fit the model to the training data and then calculate the RMSE for the testing data.

5. Fuel consumption without intercept

Next you repeat the process, but specify False for the fitIntercept parameter. Now you are creating a model which passes through the origin. When you evaluate this model you find that the RMSE is higher. So, comparing these two models you'd naturally choose the first one because it has a lower RMSE. However, there's a problem with this approach. Just getting a single estimate of RMSE is not very robust. It'd be better to make this comparison using cross-validation. You also have to manually build the models for the two different parameter values. It'd be great if that were automated.

6. Parameter grid

You can systematically evaluate a model across a grid of parameter values using a technique known as grid search. To do this you need to set up a parameter grid. You actually saw this in the previous lesson, where you simply created an empty grid. Now you are going to add points to the grid. First you create a grid builder and then you add one or more grids. At present there's just one grid, which takes two values for the fitIntercept parameter. Call the build() method to construct the grid. A separate model will be built for each point in the grid. You can check how many models this corresponds to and, of course, this is just two.

7. Grid search with cross-validation

Now you create a cross-validator object and fit it to the training data. This builds a bunch of models: one model for each fold and point in the parameter grid. Since there are two points in the grid and ten folds, this translates into twenty models. The cross-validator is going to loop through each of the points in the parameter grid and for each point it will create a cross-validated model using the corresponding parameter values. When you take a look at the average metrics attribute, you can see why the metric is given as a list: you get one average value for each point in the grid. The values confirm what you observed before: the model that includes an intercept is superior to the model without an intercept.

8. The best model & parameters

Our goal was to get the best model for the data. You retrieve this using the appropriately named bestModel attribute. But it's not actually necessary to work with this directly because the cross-validator object will behave like the best model. So, you can use it directly to make predictions on the testing data. Of course, you want to know what the best parameter value is and you can retrieve this using the explainParam() method. As expected the best value for the fitIntercept parameter is True. You can see this after the word "current" in the output.

9. A more complicated grid

It's possible to add more parameters to the grid. Here, in addition to whether or not to include an intercept, you're also considering a selection of values for the regularization parameter and the elastic net parameter. Of course, the more parameters and values you add to the grid, the more models you have to evaluate. Because each of these models will be evaluated using cross-validation, this might take a little while. But it will be time well spent, because the model that you get back will in principle be much better than what you would have obtained by just using the default parameters.

10. Find the best parameters!

Let's apply grid search on the flights and SMS models!

### Optimizing flights linear regression
Up until now you've been using the default hyper-parameters when building your models. In this exercise you'll use cross validation to choose an optimal (or close to optimal) set of model hyper-parameters.

The following have already been created:

- `regression` — a `LinearRegression` object
- `pipeline` — a pipeline with string indexer, one-hot encoder, vector assembler and linear regression and
- `evaluator` — a `RegressionEvaluator` object.

**Instructions**

- Create a parameter grid builder.
- Add grids for with `regression.regParam` (values `0.01`, `0.1`, `1.0`, and `10.0`) and `regression.elasticNetParam` (values `0.0`, `0.5`, and `1.0`).
- Build the grid.
 Create a cross validator, specifying five folds.

In [66]:
# Create parameter grid
params = ParamGridBuilder()

# Add grids for two parameters
params = params.addGrid(regression.regParam, [0.01, 0.1,1.0,10.0]) \
               .addGrid(regression.elasticNetParam, [0.0, 0.5, 1.0])

# Build the parameter grid
params = params.build()
print('Number of models to be tested: ', len(params))

# Create cross-validator
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=params, evaluator=evaluator, numFolds = 5)

Number of models to be tested:  12


### Dissecting the best flight duration model
You just set up a CrossValidator to find good parameters for the linear regression model predicting flight duration.

The model pipeline has multiple stages (objects of type `StringIndexer`, `OneHotEncoder`, `VectorAssembler` and `LinearRegression`), which operate in sequence. The stages are available as the stages attribute on the pipeline object. They are represented by a list and the stages are executed in the sequence in which they appear in the list.

Now you're going to take a closer look at the pipeline, split out the stages and use it to make predictions on the testing data.

The following objects have already been created:

- `cv` — a trained `CrossValidatorModel` object and
- `evaluator` — a `RegressionEvaluator` object.
- The flights data have been randomly split into `flights_train` and `flights_test`.

**Instructions**

- Retrieve the best model.
- Look at the stages in the best model.
- Isolate the linear regression stage and extract its parameters.
- Use the best model to generate predictions on the testing data and calculate the RMSE.

In [67]:
cv = cv.fit(flights_train.drop('features'))

In [68]:
# Get the best model from cross validation
best_model = cv.bestModel

# Look at the stages in the best model
print(best_model.stages)

# Get the parameters for the LinearRegression object in the best model
best_model.stages[3].extractParamMap()

# Generate predictions on testing data using the best model then calculate RMSE
predictions = best_model.transform(flights_test.drop('features'))
print("RMSE =", evaluator.evaluate(predictions))

[StringIndexerModel: uid=StringIndexer_1328fc2dcd3d, handleInvalid=error, OneHotEncoderModel: uid=OneHotEncoder_2363f0554273, dropLast=true, handleInvalid=error, numInputCols=1, numOutputCols=1, VectorAssembler_324580bf5543, LinearRegressionModel: uid=LinearRegression_5e483bc6ded0, numFeatures=8]
RMSE = 11.064032706878418


### SMS spam optimised
The pipeline you built earlier for the SMS spam model used the default parameters for all of the elements in the pipeline. It's very unlikely that these parameters will give a particularly good model though. In this exercise you're going to run the pipeline for a selection of parameter values. We're going to do this in a systematic way: the values for each of the parameters will be laid out on a grid and then pipeline will systematically run across each point in the grid.

In this exercise you'll set up a parameter grid which can be used with cross validation to choose a good set of parameters for the SMS spam classifier.

The following are already defined:

- hasher — a HashingTF object and
- logistic — a LogisticRegression object.

**Instructions**

- Create a parameter grid builder object.
- Add grid points for `numFeatures` and `binary` parameters to the `HashingTF` object, giving values 1024, 4096 and 16384, and True and False, respectively.
- Add grid points for `regParam` and `elasticNetParam` parameters to the `LogisticRegression` object, giving values of 0.01, 0.1, 1.0 and 10.0, and 0.0, 0.5, and 1.0 respectively.
- Build the parameter grid.

In [69]:
# Create parameter grid
params = ParamGridBuilder()

# Add grid for hashing trick parameters
params = params.addGrid(hasher.numFeatures, [1024, 4096, 16384]) \
               .addGrid(hasher.binary, [True, False])

# Add grid for logistic regression parameters
params = params.addGrid(logistic.regParam, [0.01, 0.1, 1.0, 10.0]) \
               .addGrid(logistic.elasticNetParam, [0.0, 0.5, 1.0])

# Build parameter grid
params = params.build()

### How many models for grid search?
How many models will be built when the cross-validator below is fit to data?

`params = ParamGridBuilder().addGrid(hasher.numFeatures, [1024, 4096, 16384]) \
                           .addGrid(hasher.binary, [True, False]) \
                           .addGrid(logistic.regParam, [0.01, 0.1, 1.0, 10.0]) \
                           .addGrid(logistic.elasticNetParam, [0.0, 0.5, 1.0]) \
                           .build()

cv = CrossValidator(..., estimatorParamMaps=params, numFolds=5)`

Possible Answers

- 3

- 5

- 72

- **360**

## Ensemble
1. Ensemble

You now know how to choose a good set of parameters for any model using cross-validation and grid search. In the final lesson you're going to learn about how models can be combined to form a collection or "ensemble" which is more powerful than each of the individual models alone.

2. What's an ensemble?

Simply put, an ensemble model is just a collection of models. An ensemble model combines the results from multiple models to produce better predictions than any one of those models acting alone. The concept is based on the idea of the "Wisdom of the Crowd", which implies that the aggregated opinion of a group is better than the opinions of the individuals in that group, even if the individuals are experts.

3. Ensemble diversity

As the quote suggests, for this idea to be true, there must be diversity and independence in the crowd. This applies to models too: a successful ensemble requires diverse models. It does not help if all of the models in the ensemble are similar or exactly the same. Ideally each of the models in the ensemble should be different.

4. Random Forest

A Random Forest, as the name implies, is a collection of trees. To ensure that each of those trees is different, the Decision Tree algorithm is modified slightly: - each tree is trained on a different random subset of the data and - within each tree a random subset of features is used for splitting at each node. The result is a collection of trees where no two trees are the same. Within the Random Forest model, all of the trees operate in parallel.

5. Create a forest of trees

Let's go back to the cars classifier yet again. You create a Random Forest model using the RandomForestClassifier class from the classification sub-module. You can select the number of trees in the forest using the numTrees parameter. By default this is twenty, but we'll drop that to five so that the results are easier to interpret. As is the case with any other model, the Random Forest is fit to the training data.

6. Seeing the trees

Once the model is trained it's possible to access the individual trees in the forest using the trees attribute. You would not normally do this, but it's useful for illustrative purposes. There are precisely five trees in the forest, as specified. The trees are all different, as can be seen from the varying number of nodes in each tree. You can then make predictions using each tree individually.

7. Predictions from individual trees

Here are the predictions of individual trees on a subset of the testing data. Each row represents predictions from each of the five trees for a specific record. In some cases all of the trees agree, but there is often some dissent amongst the models. This is precisely where the Random Forest works best: where the prediction is not clear cut. The Random Forest model creates a consensus prediction by aggregating the predictions across all of the individual trees.

8. Consensus predictions

You don't need to worry about these details though because the transform() method will automatically generate a consensus prediction column. It also creates a probability column which assigns aggregate probabilities to each of the outcomes.

9. Feature importances

It's possible to get an idea of the relative importance of the features in the model by looking at the featureImportances attribute. An importance is assigned to each feature, where a larger importance indicates a feature which makes a larger contribution to the model. Looking carefully at the importances we see that feature 4 (rpm) is the most important, while feature 0 (the number of cylinders) is the least important.

10. Gradient-Boosted Trees

The second ensemble model you'll be looking at is Gradient-Boosted Trees. Again the aim is to build a collection of diverse models, but the approach is slightly different. Rather than building a set of trees that operate in parallel, now we build trees which work in series. The boosting algorithm works iteratively. First build a decision tree and add to the ensemble. Then use the ensemble to make predictions on the training data. Compare the predicted labels to the known labels. Now identify training instances where predictions were incorrect. Return to the start and train another tree which focuses on improving the incorrect predictions. As trees are added to the ensemble its predictions improve because each new tree focuses on correcting the shortcomings of the preceding trees.

11. Boosting trees

The class for the Gradient-Boosted Tree classifier is also found in the classification sub-module. After creating an instance of the class you fit it to the training data.

12. Comparing trees

You can make an objective comparison between a plain Decision Tree and the two ensemble models by looking at the values of AUC obtained by each of them on the testing data. Both of the ensemble methods score better than the Decision Tree. This is not too surprising since they are significantly more powerful models. It's also worth noting that these results are based on the default parameters for these models. It should be possible to get even better performance by tuning those parameters using cross-validation.

13. Ensemble all of the models!

In the final set of exercises you'll try out ensemble methods on the flights data.

### Delayed flights with Gradient-Boosted Trees
You've previously built a classifier for flights likely to be delayed using a Decision Tree. In this exercise you'll compare a Decision Tree model to a Gradient-Boosted Trees model.

The `flights` data have been randomly split into `flights_train` and `flights_test`.

**Instructions**

- Import the classes required to create Decision Tree and Gradient-Boosted Tree classifiers.
- Create Decision Tree and Gradient-Boosted Tree classifiers. Train on the training data.
- Create an evaluator and calculate AUC on testing data for both classifiers. Which model performs better?
- For the Gradient-Boosted Tree classifier print the number of trees and the relative importance of features.

In [70]:
# Import the classes required
from pyspark.ml.classification import DecisionTreeClassifier, GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Create model objects and train on training data
tree = DecisionTreeClassifier().fit(flights_train)
gbt = GBTClassifier().fit(flights_train)

# Compare AUC on testing data
evaluator = BinaryClassificationEvaluator()
print(evaluator.evaluate(tree.transform(flights_test)))
print(evaluator.evaluate(gbt.transform(flights_test)))

# Find the number of trees and the relative importance of features
print(gbt.trees)
print(gbt.featureImportances)

0.5246650705347962
0.565571990782101
[DecisionTreeRegressionModel: uid=dtr_a0b0f3da27b3, depth=5, numNodes=39, numFeatures=1, DecisionTreeRegressionModel: uid=dtr_9e1946100902, depth=5, numNodes=27, numFeatures=1, DecisionTreeRegressionModel: uid=dtr_b6afe141f2a5, depth=5, numNodes=27, numFeatures=1, DecisionTreeRegressionModel: uid=dtr_a06f06f745a7, depth=5, numNodes=27, numFeatures=1, DecisionTreeRegressionModel: uid=dtr_e04254236560, depth=5, numNodes=29, numFeatures=1, DecisionTreeRegressionModel: uid=dtr_9f73b7dc7ce7, depth=5, numNodes=29, numFeatures=1, DecisionTreeRegressionModel: uid=dtr_0ef0d6eceb0c, depth=5, numNodes=29, numFeatures=1, DecisionTreeRegressionModel: uid=dtr_a1a004d77c7a, depth=5, numNodes=29, numFeatures=1, DecisionTreeRegressionModel: uid=dtr_0f827249ea80, depth=5, numNodes=29, numFeatures=1, DecisionTreeRegressionModel: uid=dtr_4a5487e0eec5, depth=5, numNodes=29, numFeatures=1, DecisionTreeRegressionModel: uid=dtr_1a6c2a8fec82, depth=5, numNodes=29, numFeatur

### Delayed flights with a Random Forest
In this exercise you'll bring together cross validation and ensemble methods. You'll be training a Random Forest classifier to predict delayed flights, using cross validation to choose the best values for model parameters.

You'll find good values for the following parameters:

- `featureSubsetStrategy` — the number of features to consider for splitting at each node and
- `maxDepth` — the maximum number of splits along any branch.

Unfortunately building this model takes too long, so we won't be running the `.fit()` method on the pipeline.

**Instructions**

- Create a random forest classifier object.
- Create a parameter grid builder object. Add grid points for the `featureSubsetStrategy` and `maxDepth` parameters.
- Create binary classification evaluator.
- Create a cross-validator object, specifying the estimator, parameter grid and evaluator. Choose 5-fold cross validation.

In [71]:
from pyspark.ml.classification import RandomForestClassifier

In [75]:
# Create a random forest classifier
forest = RandomForestClassifier()

# Create a parameter grid
params = ParamGridBuilder() \
            .addGrid(forest.featureSubsetStrategy, ['all', 'onethird', 'sqrt', 'log2']) \
            .addGrid(forest.maxDepth, [2, 5, 10]) \
            .build()

# Create a binary classification evaluator
evaluator = BinaryClassificationEvaluator()

# Create a cross-validator
cv = CrossValidator(estimator = forest, evaluator = evaluator, estimatorParamMaps=params,  numFolds = 5)

### Evaluating Random Forest
In this final exercise you'll be evaluating the results of cross-validation on a Random Forest model.

The following have already been created:

- `cv` - a cross-validator which has already been fit to the training data
- `evaluator` — a `BinaryClassificationEvaluator` object and
- `flights_test` — the testing data.

**Instructions**

- Print a list of average AUC metrics across all models in the parameter grid.
- Display the average AUC for the best model. This will be the largest AUC in the list.
- Print an explanation of the `maxDepth` and `featureSubsetStrategy` parameters for the best model.
- Display the AUC for the best model predictions on the testing data.

In [79]:
cv = cv.fit(flights_train)

In [80]:
# Average AUC for each parameter combination in grid
print(cv.avgMetrics)

# Average AUC for the best model
print(max(cv.avgMetrics))

# What's the optimal parameter value for maxDepth?
print(cv.bestModel.explainParam('maxDepth'))
# What's the optimal parameter value for featureSubsetStrategy?
print(cv.bestModel.explainParam('featureSubsetStrategy'))

# AUC for best model on testing data
print(evaluator.evaluate(cv.bestModel.transform(flights_test)))

[0.5589995799344675, 0.568171669034667, 0.5675617404647673, 0.5589995799344675, 0.568171669034667, 0.5675617404647673, 0.5589995799344675, 0.568171669034667, 0.5675617404647673, 0.5589995799344675, 0.568171669034667, 0.5675617404647673]
0.568171669034667
maxDepth: Maximum depth of the tree. (>= 0) E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes. Must be in range [0, 30]. (default: 5, current: 5)
featureSubsetStrategy: The number of features to consider for splits at each tree node. Supported options: 'auto' (choose automatically for task: If numTrees == 1, set to 'all'. If numTrees > 1 (forest), set to 'sqrt' for classification and to 'onethird' for regression), 'all' (use all features), 'onethird' (use 1/3 of the features), 'sqrt' (use sqrt(number of features)), 'log2' (use log2(number of features)), 'n' (when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features). default = 'auto' (default:

# Closing thoughts

1. Closing thoughts

Congratulations on completing this course on Machine Learning with Apache Spark. You have covered a lot of ground, reviewing some Machine Learning fundamentals and seeing how they can be applied to large datasets, using Spark for distributed computing.

2. Things you've learned

You learned how to load data into Spark and then perform a variety of operations on those data. Specifically, you learned basic column manipulation on DataFrames, how to deal with text data, bucketing continuous data and one-hot encoding categorical data. You then delved into two types of classifiers, Decision Trees and Logistic Regression, in the process building a robust spam classifier. You also learned about partitioning your data and how to use testing data and a selection of metrics to evaluate a model. Next you learned about regression, starting with a simple linear regression model and progressing to penalized regression, which allowed you to build a model using only the most relevant predictors. You learned about pipelines and how they can make your Spark code cleaner and easier to maintain. This led naturally into using cross-validation and grid search to derive more robust model metrics and use them to select good model parameters. Finally you encountered two forms of ensemble models.

3. Learning more

Of course, there are many topics that were not covered in this course. If you want to dig deeper then consult the excellent and extensive online documentation. Importantly you can find instructions for setting up and securing a Spark cluster.

4. Congratulations!

Now go and use what you've learned to solve challenging and interesting big data problems in the real world!