# Purpose of this course

Throughout this course, we’ll cover different cloud environments and tools for building scalable data and model pipelines. The goal is to provide readers with the opportunity to get hands-on experience and start building with a number of different tools. Since this course is targeted at analytics practitioners with prior Python experience, we’ll walk through examples from start to finish but won’t dig into the details of the programming language itself.

## Role of data science

The role of data science is constantly transforming and adding new specializations. Data scientists that build production-grade services are often called applied scientists. Their goal is to build systems that are scalable and robust. To be scalable, we need to use tools that can parallelize and distribute code.

Parallelizing the code means that we can perform multiple tasks simultaneously, and distributing the code means that we can scale up the number of machines needed to accomplish a task. Robust services are systems that are resilient and can recover from failure. While the focus of this course is on scalability rather than robustness, we will cover monitoring systems in production and discuss the measuring model’s performance over time.

During my career as a data scientist, I’ve worked at a number of video game companies and have had experience putting propensity models, lifetime-value predictions, and recommendation systems into production. Overall, this process has become more streamlined with the development of tools such as PySpark, enabling data scientists to build end-to-end products more rapidly.

While many companies now have engineering teams with machine learning focuses, it’s valuable for data scientists to have broad expertise in productizing models. Owning more of the process means that a data science team can deliver products quicker and iterates much more rapidly.

## Role of data products
Data products are useful for organizations because they provide personalization for the user base. For example, the recommendation system that I designed for EverQuest Landmark provided curated content for players from a marketplace with thousands of user-created items. The goal of any data product should be to create value for an organization. The recommendation system accomplished this goal by increasing the revenue generated from user-created content. Propensity models, which predict the likelihood of a user to perform an action, can also have a direct impact on core metrics for an organization, by enabling personalized experiences that increase user engagement

The process used to productize models is usually unique for each organization, due to different cloud environments, databases, and product organizations. However, many of the same tools are used within these workflows, such as SQL and PySpark. Your organization may not be using the same data ecosystem as these examples, but the methods should transfer to your use cases..

## Chapter learning outcomes
In this chapter we will,

Introduce the role of applied science and motivate the usage of Python for building data products.
Discuss different cloud and coding environments for scaling up data science.
Introduce the data sets and types of models used throughout the course.
Introduce automated feature engineering as a step to include in data science workflows.

# Applied Data Science
Data science is a broad discipline with many different specializations. One distinction that is becoming common is product data science and applied data science.

## Product data science
Product data scientists are typically embedded on a product team, such as a game studio, and they provide analysis and modeling that helps the team directly improve the product. For example, a product data scientist might find an issue with the first-time user experience in a game and make recommendations, such as which languages to focus on for localization to improve new user retention.

## Applied data science
Applied data science is at the intersection of machine learning engineering and data science. Applied data scientists focus on building data products that product teams can integrate. For example, an applied scientist at a game publisher might build a recommendation service that different game teams can integrate into their products. Typically, this role is part of a central team that is responsible for owning a data product. A data product is a production system that provides predictive models, such as identifying which items a player is likely to buy.

## Need for applied scientists
Applied scientist is a job title that is growing in usage across tech companies, including Amazon, Facebook, and Microsoft. The need for this type of role is growing because a single applied scientist can provide tremendous value to an organization. For example, instead of having product data scientists build bespoke propensity models for individual games, an applied scientist can build a scalable approach that provides a similar service across a portfolio of games.

At Zynga, one of the data products built by the team of applied scientists was a system called AutoModel, which provides several propensity models for all of their games, including the likelihood for a specific player to churn.

## Tools for applied scientists
There have been a few developments in technology that have made applied science a reality.Tools for automated feature engineering, such as deep learning, and scalable computing environments, such as PySpark, have enabled companies to build large-scale data products with smaller team sizes.

The need to hire engineers for data ingestion and warehousing, data scientists for predictive modelling, and additional engineers for building a machine learning infrastructure are reduced. We can now use managed services in the cloud to enable applied scientists to take on more of the responsibilities previously designated to engineering teaiewer

One of the goals of this course is to help data scientists make the transition to applied science, by providing hands-on experience with different tools that can be used for scalable computing and standing up services for predictive models. We will work through different tools and cloud environments to build proof of concepts for data products that can translate to production environments.

# Python for Scalable Compute

## Python for data science

Python is quickly becoming the de facto language for data science. In addition to the huge library of packages that provide useful functionalities, one of the reasons that Python is becoming so popular is that it can be used for building scalable data and predictive model pipelines.

## Example of Python modeling

In [None]:
# from sklearn.linear_model import LinearRegression
# from sklearn.model_selection import train_test_split
# from sklearn.datasets import load_boston
# import pandas as pd
# import numpy as np

# # load Boston housing data set 
# data_url = "http://lib.stat.cmu.edu/datasets/boston"
# raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
# data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
# target = raw_df.values[1::2, 2]

# bostonDF = pd.DataFrame(data, columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT'])
# bostonDF['label'] = target

# # create train and test splits of the housing data set 
# x_train, x_test, y_train, y_test = train_test_split(bostonDF.drop(['label'], axis=1), bostonDF['label'], test_size=0.33)

# # train a linear regression model
# model = LinearRegression()
# model.fit(x_train, y_train)

# # print results 
# print("R^2: " + str(model.score(x_test, y_test)))
# print("Mean Error: " + str(sum(abs(y_test - model.predict(x_t```est) ))/y_test.count()))

You can use Python on your local machine and build predictive models with scikit-learn, or you can use environments such as Dataflow and PySpark to build distributed systems. While these different environments use different libraries and programming paradigms, they’re all in the same language of Python.

It’s no longer necessary to translate an R script into a production language such as Java; you can use the same language for both development and production of predictive models. It took me a while to adopt Python as my data science language of choice.

Java had been my preferred language, regardless of the task, since early in my undergraduate career. For data science tasks, I used tools like Weka to train predictive models. I still find Java to be useful when building data pipelines, and it’s great to know for directly collaborating with engineering teams on projeviewer

I later switched to R while working at Electronic Arts, and found the transition to an interactive coding environment to be quite useful for data science. One of the features I really enjoyed in R is R Markdown, which you can use to write documents with inline code.

## Reasons to learn Python
When I started working at Zynga in 2018, I adopted Python and haven’t looked back. It took a bit of time to get used to the new language, but there were a number of reasons that convinced me to learn Python

Following are some of the reasons:

**Momentum**: Many teams are already using Python for production, or portions of their data pipelines. It makes sense to also use Python for performing analysis tasks.

**PySpark**: R and Java don’t provide a good transition to authoring Spark tasks interactively. You can use Java for Spark, but it’s not a good fit for exploratory work. Additionally, the transition from Python to PySpark seems to be the most approachable way to learn Spark.

**Deep learning**: I’m interested in deep learning, and while there are R bindings for libraries such as Keras, it’s better to code in the native language of these libraries. I used R to author custom loss functions previously, and I had problems figuring out debugging errors.

**Libraries**: In addition to the deep learning libraries offered for Python, there are a number of other useful tools, including Flask and Bokeh. There are also notebook environments that can scale, including Google’s Colaboratory and AWS SageMaker..

## From R to Python
To ease the transition from R to Python, I’d recommend the following steps:

Focus on outcomes, not semantics: Instead of learning about all the fundamentals of the language, I first focused on doing what I already knew how to do in other languages in Python, such as training a logistic regression model.

Learn the ecosystem, not the language: I didn’t limit myself to the base language when learning. Instead, I jumped right into using Pandas and scikit-learn.

Use cross-language libraries: I already had experience with Keras and Plotly in R and used knowledge of these libraries to bootstrap learning Python.

Work with real-world data: I used the data sets provided by Google’s BigQuery to test out my scripts on large-scale data.

Start locally, if possible: While one of my goals was to learn PySpark, I first focused on getting things up and running them on my local machine before moving to cloud ecosystems.

There are many situations where Python is not the best choice for a specific task, but it does have broad applicability when prototyping models and building scalable model pipelines.

Because of Python’s rich ecosystem, we will be using it for all the examples in this course.

# Cloud Environments

## Scalable data science pipeline
To build scalable data science pipelines, it’s necessary to move beyond single machine scripts and move to clusters of machines. While this is possible to do with an on-premise setup, a common trend is using cloud computing environments to achieve large-scale processing. There are a number of different options available, with the top three platforms currently being Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure

Most cloud platforms offer free credits for getting started. GCP offers a $300
 credit for new users to get hands-on with their tools, while AWS provides free-tier access for many services.

In this course, we’ll get hands-on with both AWS and GCP with little to no cost involved..

## Amazon Web Services (AWS)
AWS is currently the largest cloud provider, and this dominance has been demonstrated by the number of gaming companies using the AWS platform. I’ve had experience working with AWS at Electronic Arts, Twitch, and Zynga. This platform has a wide range of tools available, but getting these components to work well together generally takes additional engineering work versus GCP.

With AWS, you can use both self-hosted and managed solutions for building data pipelines. For example, the managed option for messaging on AWS is Kinesis, while the self-hosted option is Kafka. We’ll walk through examples with both of these options in Streaming Model Workflows. There’s typically a tradeoff between cost and DevOps when choosing between self-hosted and managed options.

## Redshift
The default database to use on AWS is Redshift, which is a columnar database. This option works well as a data warehouse, but it doesn’t scale well to data lake volumes. It’s common for organizations to set up a separate data lake and data warehouse on AWS. For example, it’s possible to store data on S3 and use tools such as Athena to provide data lake functionality while using Redshift as a solution for a data warehouse. This approach has worked well in the past, but it creates issues when building large-scale data science pipelines. Moving data in and out of a relational database can be a bottleneck for these types of workflows. One of the solutions to this bottleneck is to use vendor solutions that separate storage from compute, such as Snowflake or Delta Lake.

## Elastic Compute (EC2)

The first component we’ll work with in AWS is Elastic Compute (EC2) instances. These are individual virtual machines that you can spin up and provision for any necessary task. In the next lesson, we’ll show you how to set up an instance that provides a remote Jupyter environment. EC2 instances are great for getting started with tools, such as Flask and Gunicorn, and getting started with Docker. To scale up beyond individual instances, we’ll explore Lambda functions and Elastic Container Services.

## PySpark
To build scalable pipelines on AWS, we’ll focus on PySpark as the main environment. PySpark enables Python code to be distributed across a cluster of machines, and vendors such as Databricks provide managed environments for Spark. Another option available only on AWS is SageMaker, which provides a Jupyter notebook environment for training and deploying models.

## Google Cloud Platform (GCP)
GCP is currently the third-largest cloud platform provider and offers a wide range of managed tools.

GCP is currently being used by large media companies such as Spotify and within the gaming industry, e.g., companies like King and Niantic. One of the main benefits of using GCP is that many of the components can be wired together using Dataflow, which is a tool for building batch and streaming data pipelines. We’ll create a batch model pipeline with Dataflow in Cloud Data Flow for Batch Modeling and a streaming pipeline in Streaming Model Workflows.

Google Cloud Platform currently offers a smaller set of tools than AWS, but there is feature parity for many common tools, such as PubSub in place of Kinesis and Cloud Functions in place of AWS Lambda. One area where GCP provides an advantage is BigQuery as a database solution. BigQuery separates storage from compute and can scale to both data lake and data warehouse use cases.

## Dataflow with GCP

Dataflow is one of the most powerful tools for data scientists that GCP provides because it empowers a single data scientist to build large-scale pipelines with much less effort than other platforms. It enables building streaming pipelines that connect PubSub for messaging, BigQuery for analytics data stores, and BigTable for application databases. It’s also a managed solution that can autoscale to meet demand. While the original version of Dataflow was specific to GCP, it’s now based on the Apache Beam library which is portable to other platforms.

# Coding Environments

## Options for writing Python code
There’s a variety of options for writing Python code to do data science. The best environment to use likely varies based on what you are building, but notebook environments are becoming more and more common as the place to write Python scripts. The three types of coding environments I’ve worked with for Python are IDEs, text editors, and notebooks.

If you’re used to working with an IDE, tools like PyCharm and Rodeo are useful editors and provide additional tools for debugging versus other options. It’s also possible to write code in text editors such as Sublime and then run scripts via the command line. I find this works well for building web applications with Flask and Dash, where you need to have a long-running script that persists beyond the scope of running a cell in a notebook. I now perform the majority of my data science work in notebook environments, because they cover exploratory analysis and productizing models.

I like to work in coding environments that make it trivial to share code and collaborate on projects. Databricks and Google Colab are two coding environments that provide truly collaborative notebooks, where multiple data scientists can simultaneously work on a script. When using Jupyter notebooks, this level of real-time collaboration is not currently supported. Still, it’s good practice to share notebooks in version control systems such as GitHub for sharing work.

In this course, we’ll only use the text editor and notebook environments for coding. To learn how to build scalable pipelines, I recommend working on a remote machine, such as EC2, becoming more familiar with cloud environments, and building experience setting up Python environments outside of your local machine.

We need a remote machine that we can use for Python scripting. Accomplishing this task requires spinning up an EC2 instance, configuring firewall settings for the EC2 instance, connecting to the instance using SSH, and running a few commands to deploy a Jupyter environment on the machine.

## Setting up an EC2 instance
The first step is to set up an AWS account and log into the AWS management console. AWS provides a free account option with free-tier access to a number of services including EC2. Next, provision a machine using the following steps:

Under “Find Services”, search for EC2.
Click “Launch Instance”.
Select a free-tier Amazon Linux AMI.
Click “Review and Launch”, and then “Launch”.
Create a key pair and save it to your local machine.
Click “Launch Instances”.
Click “View Instances”.
The machine may take a few minutes to provision. Once the machine is ready, the instance state is set to “running.” We can now connect to the machine via SSH. One note on the different AMI options is that some of the configurations are set up with Python already installed. However, this course focuses on Python 3, and the included version is often 2.7.

![](imgs/5-1.png)

## Jupyter on EC2
To get experience setting up a remote machine, we’ll start by setting up a Jupyter notebook environment on an EC2 instance in AWS. The result is a remote machine that we can use for Python scripting.

## Connecting Jupyter to your machine
There are two different IPs that you need in order to connect to the machine via SSH, and later connect to the machine via a web browser. The public and private IPs are listed under the “Description” tab as shown in the figure above. To connect to the machine via SSH we’ll use the Public IP (54.87.230.152). For connecting to the machine, you’ll need to use an SSH client such as Putty if working in a Windows environment. For Linux and Mac OS, you can use SSH, via the command line. To connect to the machine, use the user name “ec2-user” and the key pair generated when launching the instance. An example of connecting to EC2 using the Bitvise client on Windows is shown in the figure below:

![](imgs/5-2.png)

Once you connect to the machine, you can check the Python version by running the python --version. On my machine, the result was 2.7.16, meaning that an additional setup is needed in order to upgrade to Python 3, pip, and Jupyter:

In [None]:
# apt-get install python3.9 -y 
# python3 --version
# apt install python3-pip -y 
# pip3 --version
# pip3 install jupyter==1.0.0

## The two version commands are to confirm that the machine is pointing at Python 3 for both Python and pip.

Once Jupyter is installed, we’ll need to set up a firewall restriction so that we can connect directly to the machine on port 8888, where Jupyter runs by default. This approach is the quickest way to get connected to the machine, but it’s advised to use SSH tunneling to connect to the machine rather than a direct connection over the open web. You can open up port 8888 for your local machine by performing the following steps from the EC2 console:

Select your EC2 instance.

Under “Description”, select security groups.

Click “Actions” -> “Edit Inbound Rules”.

Add a new rule: change the port to 8888, select “My IP”.

Click “Save”.

## Running Jupyter on EC2
We can now run and connect to Jupyter on the EC2 machine. To launch Jupyter, run the command shown below while replacing the IP with your EC2 instance’s Private IP. It is necessary to specify the --ip parameter in order to enable remote connections to Jupyter, as incoming traffic will be routed via the private IP.

``jupyter notebook --ip 172.31.53.82``

When you run the jupyter notebook command, you’ll get a URL with a token that can be used to connect to the machine. Before entering the URL into your browser, you’ll need to swap the Private IP output to the console with the Public IP of the EC2 instance, as shown in the snippet below:

In [None]:
# # Original URL
# The Jupyter Notebook is running at:
# http://172.31.53.82:8888/?token=
#     98175f620fd68660d26fa7970509c6c49ec2afc280956a26

# # Swap Private IP with Public IP
# http://54.87.230.152:8888/?token=
#     98175f620fd68660d26fa7970509c6c49ec2afc280956a26

You can now paste the updated URL into your browser to connect to Jupyter on the EC2 machine. The result should be a Jupyer notebook fresh install with a single file get-pip.py in the base director.

# Introduction to Datasets

## Distributed data sources
To build scalable data pipelines, we’ll need to switch from using local files, such as CSVs, to distributed data sources, such as Parquet files on S3. While the tools used across cloud platforms to load data vary significantly, the end result is usually the same, which is a dataframe. In a single machine environment, we can use Pandas to load the dataframe, while distributed environments use different implementations such as Spark dataframes in PySpark.

In this lesson, we will introduce the datasets that we’ll explore throughout the course. In this chapter, we’ll focus on loading the data using a single machine, while later chapters will present distributed approaches. Although most of the datasets presented here can be downloaded as CSV files and can be read into Pandas using read_csv, it’s good practice to develop automated workflows to connect to diverse data sources.

## Common datasets
We’ll explore the following datasets throughout this course:

Boston Housing: records of sale prices of homes in the Boston housing market back in 1980
Game Purchases: a synthetic dataset representing games purchased by different users on XBox One
Natality: one of BigQuery’s open datasets on birth statistics in the US over multiple decades
Kaggle NHL: play-by-play events from professional hockey games and game statistics over the past decade
The first two datasets only need a single command to load them, as long as you have the required libraries installed.

The Natality and Kaggle NHL datasets require setting up authentication files before you programmatically pull the data sources into Pandas.

## Load data from a library
The first approach we’ll use to load a dataset is retrieving it directly from a library. Multiple libraries include the Boston housing dataset because it is a small dataset that is useful for testing out regression models. We’ll load it from scikit-learn by first running pip from the command line:

``pip3 install pandas==1.3.5`` \
``pip3 install sklearn>=1.0.2``

Once scikit-learn is installed, we can switch back to the Jupyter notebook to explore the dataset. The code snippet below shows how to load the scikit-learn and Pandas libraries, load the Boston dataset as a Pandas dataframe, and display the first 5 records:

In [None]:
# from sklearn.datasets import load_boston
# import pandas as pd
# import numpy as np

# data_url = "http://lib.stat.cmu.edu/datasets/boston"
# raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
# data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
# target = raw_df.values[1::2, 2]
# bostonDF = pd.DataFrame(data,columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT'])
# bostonDF['label'] = target
# bostonDF.head()

## Load data from web
The second approach we’ll use to load a dataset is fetching it from the web. The CSV for the Games dataset is available as a single file on GitHub. We can fetch it into a Pandas dataframe by using the read_csv function and passing the URL of the file as a parameter.

In [None]:
# import pandas as pd
# gamesDF = pd.read_csv("https://github.com/bgweber/Twitch/raw/master/Recommendations/games-expand.csv")
# gamesDF.head()

Both of these approaches are similar to downloading CSV files and reading them from a local directory, but by using these methods, we can avoid the manual step of downloading files.

**This behavior is useful to avoid in order to build automated workflows in Python.**

# BigQuery to Pandas

## Workflows automation
One of the ways to automate workflows authored in Python is to directly connect to data sources.

For databases, you can use connectors based on JDBC or native connectors, such as the bigquery module provided by the Google Cloud library. This connector enables Python applications to send queries to BigQuery and load the results as a Pandas dataframe. This process involves setting up a GCP project, installing the prerequisite Python libraries, setting up the Google Cloud command line tools, creating GCP credentials, and finally sending queries to BigQuery programmatically.


## Installing Google cloud library
The first step is to install the Google Cloud library by running the following steps:

``pip3 install google-cloud-bigquery==3.0.1`` \
``pip3 install matplotlib==3.2.2``

## Setting up gcloud CLI
Next, we’ll need to set up the Google Cloud command line tools in order to set up credentials for connecting to BigQuery. While the files to use will vary based on the current release, here are the steps I ran on the command line:

In [None]:
# curl apt-transport-https ca-certificates gnupg &&\
# echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list &&\
# curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - &&\
# apt-get install -y google-cloud-sdk

## Setting up credentials 
Once the Google Cloud command line tools are installed, we can set up credentials for connecting to BigQuery:

In [None]:
# gcloud config set project project_name
# gcloud auth login
# gcloud init
# gcloud iam service-accounts create dsdemo 
# gcloud projects add-iam-policy-binding your_project_id\ 
#         --member "serviceAccount:dsdemo@your_project_id.iam.gserviceaccount.com" --role "roles/owner"
# gcloud iam service-accounts keys\ 
#        create dsdemo.json --iam-account\ 
#        dsdemo@your_project_id.iam.gserviceaccount.com
# export GOOGLE_APPLICATION_CREDENTIALS=/home/ec2-user/dsdemo.json

You’ll need to substitute project_name with your project name, your_project_id with your project ID, and dsdemo with your desired service account name. You can find the ID of your project on the GCP console.

The result is a JSON file with credentials for the service account. The export command at the end of this process tells Google Cloud where to find the credentials file.

Setting up credentials for Google Cloud is involved, but generally only needs to be performed once. Now that credentials are configured, it’s possible to directly query BigQuery from a Python script.

## Important note for Jupyter notebook
To directly query BigQuery from a Python script in Jupyter notebook, run the following program statement to set the GOOGLE_APPLICATION_CREDENTIALS environment variable:

In [None]:
# import os
# os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'creds.json'

The snippet below shows how to load a BigQuery client, send a SQL query to retrieve 10 rows from the Natality data set and pull the results into a Pandas dataframe.

In [None]:
# from google.cloud import bigquery
# client = bigquery.Client()
# sql = """
#   SELECT * 
#   FROM  `bigquery-public-data.samples.natality`
#   limit 10
# """

# natalityDF = client.query(sql).to_dataframe()
# natalityDF.head()

# Kaggle to Pandas

## What is Kaggle?
Kaggle is a data science website that provides thousands of open datasets to explore. While it is not possible to pull Kaggle datasets directly into Pandas dataframes, we can use the Kaggle library to programmatically download CSV files as part of an automated workflow.

Fetching data from Kaggle
The quickest way to get set up with this approach is to create an account on Kaggle. Next, go to the account tab of your profile and select “Create API Token.” Download and open the file.

Add these credentials in the keys below as we will use them throughout the course to fetch data from Kaggle. Run the code to make sure the credentials are successfully saved in **kaggle.json.**

Run vi .kaggle/kaggle.json on your EC2 instance to copy over the contents to your remote machine. The result is a credential file you can use to programmatically download datasets.

## Exploring Kaggle dataset
We’ll explore the NHL (hockey) dataset by running the following commands:


In [None]:
# pip3 install kaggle==1.5.12

# kaggle datasets download martinellis/nhl-game-data
# unzip nhl-game-data.zip
# chmod 0600 *.csv

These commands download the dataset, unzip the files into the current directory, and enable read access to the files. Now that the files are downloaded on the EC2 instance, we can load and display the Game dataset, as shown in the figure below:

In [None]:
# import pandas as pd
# nhlDF = pd.read_csv('game.csv') 
# nhlDF.head()

# Prototype Models

## Predictive models
Machine learning is one of the most important steps in the pipeline of a data product. We can use predictive models to identify which users are most likely to purchase an item, or which users are most likely to stop using a product. The goal of the following lessons is to present simple versions of predictive models that we’ll later scale up in more complex pipelines. This course will not focus on state-of-the-art models, but instead cover tools that can be applied to a variety of different machine learning algorithms.

## Machine learning model
Libraries to use
The libraries for implementing different models will vary based on the cloud platform and execution environment being used to deploy a model.

The regression models presented in the following lessons are built with scikit-learn, while the models we’ll build out with PySpark use MLlib.

# Linear Regression

Regression is a common task for supervised learning, such as predicting the value of a home, and linear regression is a useful algorithm for making predictions for these types of problems.

scikit-learn provides both linear and logistic regression models for making predictions. We’ll start by using the LinearRegression class in scikit-learn to predict home prices for the Boston housing dataset.

## Linear regression example
The code snippet below shows how to split the Boston dataset into different training and testing datasets and separate data (train_x) and label (train_y) objects, create and fit a linear regression model, and calculate error metrics on the test dataset.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

data_url = "http://lib.stat.cmu.edu/datasets/boston"
raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
target = raw_df.values[1::2, 2]
bostonDF = pd.DataFrame(data,columns=['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT'])
bostonDF['label'] = target

x_train, x_test, y_train, y_test = train_test_split(
  bostonDF.drop(['label'],axis=1),bostonDF['label'],test_size=0.3)

model = LinearRegression()
model.fit(x_train, y_train)

print("R^2: " + str(model.score(x_test, y_test)))
print("Mean Error: " + str(sum(
      abs(y_test - model.predict(x_test) ))/y_test.count()))

The ``train_test_split`` function is used to split up the dataset into 70% train and 30% holdout datasets. The first parameter is the data attributes from the Boston dataframe, with the label dropped, and the second parameter is the labels from the dataframe. The two commands at the end of the script calculate the R-squared value-based on Pearson correlation and the mean absolute error is defined as the mean difference between predicted and actual home prices.

We now have a simple model that we can productize in a number of different environments. In later lessons and chapters, we’ll explore methods for scaling features, supporting more complex regression models, and automating feature generation.

# Logistic Regression

Logistic regression is a supervised classification algorithm that is useful for predicting which users are likely to perform an action, such as purchasing a product. Using scikit-learn, the process is similar to fitting a linear regression model.

## Example
The main differences from the prior script are the dataset being used, the model object instantiated (``LogisticRegression``), and the ``predict_proba`` function used to calculate error metrics. . This function predicts a probability in the continuous range of [0,1] rather than a specific label.

The snippet below predicts which users are likely to purchase a specific game based on prior games already purchased:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import pandas as pd

# Games data set 
gamesDF = pd.read_csv("https://github.com/bgweber/Twitch/raw/master/Recommendations/games-expand.csv")

x_train, x_test, y_train, y_test = train_test_split(
    gamesDF.drop(['label'],axis=1),gamesDF['label'],test_size=0.3,random_state=42)

model = LogisticRegression()
model.fit(x_train, y_train)

print("Accuracy: " + str(model.score(x_test, y_test)))
print("ROC: " + str(roc_auc_score(y_test, 
                    model.predict_proba(x_test)[:, 1] )))

## Performance of the model
The output of this script is two metrics that describe the performance of the model on the holdout dataset. The accuracy metric describes the number of correct predictions over the total number of predictions, and the ROC metric describes the number of correctly classified outcomes based on different model thresholds.

Receiver Operating Characteristic (ROC) is a useful metric to use when the different classes being predicted are imbalanced, with noticeably different sizes. Since most players are unlikely to buy a specific game, ROC is a good metric to utilize for this use case. Running this script, I get an accuracy of 87% and an ROC score of 0.76.

Linear and logistic regression models with scikit-learn are good starting points for many machine learning projects. We’ll explore more complex models in this course, but one of the general strategies I take as a data scientist is to quickly deliver a proof of concept, and then iterate and improve a model once it is shown to provide value to an organization.

# Keras Regression

## Deep learning
While I generally recommend starting with simple approaches when building model pipelines, deep learning is becoming a popular tool for data scientists to apply to new problems.

It’s great to explore this capability when tackling new problems, but scaling up deep learning in data science pipelines presents a new set of challenges. For example, PySpark does not currently have a native way of distributing the model application phase to big data.But fortunately, there are plenty of books for getting started with deep learning in Python, such as Deep Learning with Python, Chollet, 2017.

In this lesson, we’ll repeat the same task from the prior lessons, which is predicting which users are likely to buy a game based on their prior purchases. Instead of using a shallow learning approach to predict propensity scores, we’ll use the Keras framework to build a neural network for predicting this outcome.

## Installing Keras
Keras is a general framework for working with deep learning implementations. We can install these dependencies from the command line:

``pip3 install tensorflow==2.8`` \
``pip3 install keras==2.8.0``

## Verifying installation
This process can take a while to complete, and based on your environment, you may run into installation issues. It’s recommended to verify that the installation worked by checking your Keras version in a Jupyter notebook:

In [None]:
import tensorflow as tf
import keras
from keras import models, layers
import matplotlib.pyplot as plt
print(keras.__version__)

## Building models
The general process for building models with Keras is to set up the structure of the model, compile the model, fit the model, and evaluate the model.

## Setting up the model
We’ll start with a simple model to provide a baseline, this is shown in the snippet below. This code creates a network with an input layer, a dropout layer, a hidden layer, and an output layer. The input to the model is 10 binary variables that describe prior games purchased, and the output is a prediction of the likelihood to purchase a specified game.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import matplotlib.pyplot as plt

gamesDF = pd.read_csv("https://github.com/bgweber/Twitch/raw/master/Recommendations/games-expand.csv")
## Code must be executable
x_train, x_test, y_train, y_test = train_test_split(
  gamesDF.drop(['label'], axis=1),gamesDF['label'],test_size=0.3)


# define the network structure 
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(64, activation='relu', input_shape=(10,)))
model.add(tf.keras.layers.Dropout(0.1))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))


# compile and fit the model    
model.compile(optimizer='adam',
                 loss='binary_crossentropy', metrics=[tf.keras.metrics.AUC()])
history = model.fit(x_train, y_train, epochs=100, batch_size=100, 
                 validation_split = .2, verbose=0)

Since the goal is to identify the likelihood of a player to purchase a game, ROC is a good metric to use to evaluate the performance of a model.

## Optimizing the model
Next, we specify how to optimize the model. We’ll use rmsprop for the optimizer and binary_crossentropy for the loss function. The last step is to train the model. The code snippet shows how to fit the model using the training dataset, 100 training epochs with a batch size of 100, and a cross-validation split of 20. This process can take a while to run if you increase the number of epochs or decrease the batch size. The validation set is sampled from only the training dataset.

The result of this process is a history object that tracks the loss and metrics on the training and validation datasets. The code snippet below shows how to plot these values using Matplotlib:

In [None]:
loss = history.history['auc']
val_loss = history.history['val_auc']
epochs = range(1, len(loss) + 1)

plt.figure(figsize=(10,6) )
plt.plot(epochs, loss, 'bo', label='Training AUC')
plt.plot(epochs, val_loss, 'b', label='Validation AUC')
plt.legend()
plt.show()

To compare this approach with the logistic regression results, we’ll evaluate the performance of the model on the holdout dataset.

## Measuring performance
To measure the performance of the model on the test dataset, we can use the evaluate function to measure the ROC metric. The code snippet below shows how to perform this task on our training dataset, which results in an ROC AUC value of 0.80.

In [None]:
results = model.evaluate(x_test, y_test, verbose = 0)
print("ROC: " + str(results[1]))

With an AUC value of 0.76, this is noticeably better than the logistic regression model’s performance, but using other shallow learning methods such as random forests or XGBoost would likely perform much better on this task.

# Automated Feature Engineering

## Overview
Automated feature engineering is a powerful tool for reducing the amount of manual work needed in order to build predictive models. Instead of a data scientist spending days or weeks coming up with the best features to describe a dataset, we can use tools that approximate this process. One library I’ve been working with to implement this step is FeatureTools. It takes inspiration from the automated feature engineering process in deep learning. However, it is meant for shallow learning problems where you already have structured data but need to translate multiple tables into a single record per user.

## Installing libraries
The library can be installed as follows:

In [None]:
sudo yum install gcc
sudo yum install python3-devel
pip3 install framequery
pip3 install fsspec
pip3 install featuretools==1.8.0

In addition to this library, I loaded the framequery library, which enables writing SQL queries against dataframes. Using SQL to work with dataframes versus specific interfaces, such as Pandas, is useful when translating between different execution environments.

## Getting started
The task we’ll apply the FeatureTools library to is predicting which games in the Kaggle NHL dataset are postseason games. We’ll make this prediction based on summarizations of the play events that are recorded for each game. Since there can be hundreds of play events per game, we need a process for aggregating these into a single summary per game.

Once we aggregate these events into a single game record, we can apply a logistic regression model to predict whether the game is regular or postseason.

## Loading datasets
The first step we’ll perform is loading the datasets and performing some data preparation, as shown below. After loading the datasets as Pandas dataframes, we drop a few attributes from the plays object, and fill any missing attributes with 0.

In [None]:
import pandas as pd

game_df = pd.read_csv("game.csv")
plays_df = pd.read_csv("game_plays.csv")

plays_df = plays_df.drop(['secondaryType', 'periodType', 
                 'dateTime', 'rink_side'], axis=1).fillna(0)

To translate the play events into a game summary, we’ll first 1-hot econde two of the attributes in the plays dataframe, and then perform deep feature synthesis. The code snippet below shows how to perform the first step, and it uses FeatureTools to accomplish this task. The result is quite similar to using the get_dummies function in Pandas, but this approach requires some additional steps.

The base representation in FeatureTools is an EntitySet, which describes a set of tables and the relationships between them, which is similar to defining foreign key constraints. To use the encode_features function, we need to first translate the plays dataframe into an entity. We can create an EntitySet directly from the plays_df object, but we also need to specify which attributes should be handled as categorical, using the variable_types dictionary parameter.

In [None]:
import featuretools as ft
from featuretools import Feature 

es = ft.EntitySet(id="plays")
es = es.entity_from_dataframe(entity_id="plays",dataframe=plays_df
           ,index="play_id", variable_types = { 
              "event": ft.variable_types.Categorical, 
               "description": ft.variable_types.Categorical })       

f1 = Feature(es["plays"]["event"])
f2 = Feature(es["plays"]["description"])

encoded, defs = ft.encode_features(plays_df, [f1, f2], top_n=10)
encoded.reset_index(inplace=True)
encoded.head()

Next, we pass a list of features to the encode_features function, which returns a new dataframe with the dummy variables and a defs object that describes how to translate an input dataframe into the 1-hot encoded format. For pipelines, where we need to apply transformations to new datasets, we’ll store a copy of the defs object for later use. We will use pipelines later on in this course.

The next step is to aggregate the hundreds of play events per game into single-game summaries, where the resulting dataframe has a single row per game. To accomplish this task, we’ll recreate the EntitySet from the prior step, but use the 1-hot encoded dataframe as the input. Next, we use the normalize_entity function to describe games as a parent object to plays events, where all plays with the same game_id are grouped together.

The last step is to use the dfs function to perform deep feature synthesis. DFS applies aggregate calculations, such as SUM and MAX, across the different features in the child dataframe in order to collapse hundreds of records into a single row.

In [None]:
es = ft.EntitySet(id="plays")
es = es.entity_from_dataframe(entity_id="plays", 
                      dataframe=encoded, index="play_id")
es = es.normalize_entity(base_entity_id="plays",
                      new_entity_id="games", index="game_id")

features,transform=ft.dfs(entityset=es,
                        target_entity="games",max_depth=2)
features.reset_index(inplace=True)
features.head()

The shape of the sampled dataframe, 5 rows by 212 columns, indicates that we have generated hundreds of features to describe each game using deep feature synthesis. Instead of hand coding this translation, we utilized the FeatureTools library to automate this process.

## Using logistic regression 
Now that we have hundreds of features for describing a game, we can use logistic regression to make predictions about the games. For this task, we want to predict whether a game is regular season or postseason, where type = 'P'. The code snippet below shows how to use the framequery library to combine the generated features with the initially loaded games dataframe using a SQL join:

In [None]:
import framequery as fq

# assign labels to the generated features
features = fq.execute("""
  SELECT f.*
    ,case when g.type = 'P' then 1 else 0 end as label
  FROM features f 
  JOIN game_df g
    on f.game_id = g.game_id
""")

# We use the type attribute to assign a label, and then we return all of the generated features and the label.
# The result is a dataframe that we can pass to scikit-learn.

## Modeling with scikit-learn 
We can re-use the logistic regression code from above to build a model that predicts whether an NHL game is a regular or postseason game. The updated code snippet to build a logistic regression model with scikit-learn is shown below. We drop the game_id column before fitting the model to avoid training the model on this attribute, which typically results in overfitting.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# create inputs for sklearn
y = features['label']
X = features.drop(['label', 'game_id'], axis=1).fillna(0)

# train a classifier 
lr = LogisticRegression()
model = lr.fit(X, y)

# Results
print("Accuracy: " + str(model.score(X, y)))
print("ROC" + str(roc_auc_score(y,model.predict_proba(X)[:,1])))

The result of this model was an accuracy of 94.7% and an ROC measure of 0.923. While we likely could have created a better performing model by manually specifying how to aggregate play events into a game summary, we were able to build a model with good accuracy while automating much of this process.

# Conclusion : Data science models, tools and environments
Conclusion to the first chapter.

Building data products is becoming an essential competency for applied data scientists. The Python ecosystem provides useful tools for taking prototype models and scaling them up to production-quality systems.

In this chapter, we laid the groundwork for the rest of this course by introducing the following:

- The datasets
- Coding tools
- Cloud environments
- Predictive models
We will use these to build scalable model pipelines. We also explored a recent Python library called FeatureTools, which enables automating much of the feature engineering steps in a model pipeline.

In our current setup, we built a simple batch model on a single machine in the cloud.