# AWS MACHINE LEARNING CERTIFICATION

## 1. DATA ENGINEERING

**AMAZON S3**
* **What is S3?**
  * Store Objects (Files) in Buckets (Directories)
  * Buckets must have globally unique name
  * Max object size is 5 TB
  * Object Tags (key/value pair - up to 10) - useful for security/lifecycle
* **Amazon S3 for Machine Learning**
  * Backbone for many AWS ML Services 
  * Create a Data Lake
    * Infinite size
    * 99.999999999% durability
    * Decoupling of storage (S3) to compute (EC2, Athena, Rekognition, Glue)
  * Centralized Architecture
  * Object Storage: supports any file format
  * Common formats for ML: CSV, JSON, Parquet, ORC, Avro, Protobuf
* **Amazon S3 for Data Partitioning**
  * Pattern for speeding up range queries (AWS Athena)
  * By Date: s3://bucket/my-data/year/month/day/hour/data_00.csv
  * By Product: s3://bucket/my-data/product-id/data_32.csv
  * You can define whatever partitioning strategy you like
  * Data partitioning will be handled by some tools we use (AWS Glue)
* **Storage Classes**

Can move between classes manually or using S3 Lifecycle configurations
  * *Standard - General Purpose*
    * 99.99% availability
    * Used for frequently accessed data
    * Low latency and high throughput
    * Sustain 2 concurrent facility failures 
    * Use cases: Big data analytics, mobile & gaming applications 
  * *Standard - Infrequent Access (IA)*
    * For data that is less frequently accessed but requires rapid access when needed
    * Lower cost than S3 Standard
    * 99.99% availability
    * Use cases: Disaster Recovery, Backups
  * *One Zone Infrequent Access*
    * For data that is less frequently accessed but requires rapid access when needed
    * Lower cost than S3 Standard
    * High durability in a single AZ but data lost when AZ is destroyed
    * Use cases: Storing secondary backup copies of on-premise data or data you can recreate 
  * *Glacier Instant Retrieval*
    * Low-cost object storage meant for archiving/backup
    * Pricing: price for storage + object retrieval cost
    * Millisecond retrieval, great for data accessed once a quarter
    * Minimum storage duration of 90 days
  * *Glacier Flexible Retrieval*
    * Low-cost object storage meant for archiving/backup
    * Pricing: price for storage + object retrieval cost
    * Expited (1-5 minutes), Standard (3-5 hours), Bulk (5-12 hours) - free
    * Minimum storage duration of 90 days
  * *Glacier Deep Archive*
    * Low-cost object storage meant for archiving/backup
    * Pricing: price for storage + object retrieval cost
    * Standard (12 hours), Bulk (48 hours)
    * Minimum storage duration of 180 days 
  * *Intelligent Tiering*
    * Small monthly monitoring and auto-tiering fee
    * Moves objects automatically between Access Tiers based on usage
    * There are no retrieval charges
    * Strategy:
      * Frequent Access tier (automatic): default tier
      * Infrequent Access tier (automatic): objects not accessed for 30 days
      * Archive Instant Access tier (automatic): objects not accessed for 90 days
      * Archive Access tier (optional): configurable from 90 to 700+ days
      * Deep Archive Access tier (optional): config. from 180 to 700+ days 
* **Durability vs Availability**
  * *Durability*
    * It represents how many times an object is going to be lost in S3
    * High durability (99.999999999%) of objects across multiple AZ
    * Same for all storage classes
  * *Availability*   
    * It measures how readily available a service is
    * Varies depending on storage class
    * E.g. S3 standard has 99.99% availability = not available 53 minutes a year

* **Lifecycle Rules**
  * You can transition objects between storage classes
  * For infrequently accessed object, move them to standard IA
  * For archive objects that you don't need fast access to, move them to Glacier or Glacier Deep Archive
  * Moving objects can be automated using a Lifecycle Rules
  * *Transition Actions*
    * Configure objects to transition to another storage class
    * Move objects to Standard IA class 60 days after creation
    * Move to Glacier for archiving after 6 months
  * *Expiration Actions*
    * Configure objects to expire (delete) after some time
    * Access log files can be set to delete after a 365 days
    * Can be used to delete old version of files (if versioning enabled)
    * Can be used to delete incomplete Multi-Part uploads

Rules can be created for certain prefix (e.g. s3://mybucket/mp3/*) or for certain objects tags (e.g. Department: Finance)

* **S3 Analytics - Storage Class Analysis**
  * Help you decide when to transition objects to the right storage class
  * Recommendations for Standard and Standard IA (don't work for One-Zone IA or Glacier)
  * Report is updated daily
  * 24 to 48 hours to start seeing analysis

<br>

If objects have a **defined lifecycle**, such as needing frequent access for a month and then never needing to be accessed again, a lifecycle policy is the most efficient and cost-effective solution for you.

Else if access patterns are unpredictable or difficult to identify, **Intelligent Tiering** is an excellent solution for hands-off storage access monitoring while still taking advantage of cost savings from S3-IA.  

* **S3 Security - Encryption**
  * S3 Encryption for Object (4 methods)
    * *SSE-S3*: encrypts S3 objects using keys handled by AWS
    * *SSE-KMS*: Use AWS Key Management Service to manage encryption keys
      * Additional security (you must have access to KMS key)
      * Audit trail for KMS key usage
    * *SSE-C*: when you want to manage your own encryption keys
    * *Client Side Encryption*
  * From an ML perspective, *SSE-S3* and *SSE-KMS* will be most likely used
  * S3 Security:
    * *User Based*:
      * IAM policies - which API calls should be allowed for a specific user
    * *Resource Based*:
      * Bucket Policies - bucket wide rules from the S3 console - allow cross account
      * Object Access Control List (ACL) - finer grain
      * Bucket Access Control List (ACL) - less common
  * S3 Bucket Policies
    * JSON based policies
      * Resources: buckets and objects
      * Actions: Set of API to Allow or Deny
      * Effect: Allow/Deny
      * Principal: the account or user to apply the policy to
    * Use S3 bucket for policy to:
      * Grant public access to the bucket
      * Force objects to be encrypted at upload   
      * Grant access to another account (Cross Account)
  * The old way to enable default encryption was to use a bucket policy, the new way is to use the defualt encryption option in S3
  * Other security types:
    * Networking - **VPC Endpoint Gateway**
      * Allow traffic to stay within your VPC (instead of public web)
      * Make sure your private services (AWS SageMaker) can access S3
      * Very important for AWS ML Exam
    * Logging and Audit
      * S3 access logs can be stored in other S3 bucket
      * API calls can be logged in AWS CloudTrail
    * Tagged Based (combined IAM policies and bucket policies)
      * Example: Add tag Classification = PHI to your object   

<br>

**KINESIS**
* *Kinesis* is a managed alternative to Apache Kafka
* Great for app logs, metrics, IoT, **real time** big data and streaming processing framworks (Spark, NiFi)
* Data is automatically replicated synchronously to 3 AZ
* **Kinesis Stream**: low latency streaming ingest at scale
  * Streams are divided in ordered Shards
  * Data retention is 24h by default, can go up to 365 days
  * Multiple apps can consume the same streams
  * Once data is inserted in Kinesis, it can't be deleted
  * Records can be up to 1MB in size
  * Capacity Modes:
    * Provisioned Mode:
      * You choose the number of shards provisioned, scale manually or using API
      * Each shard gets 1MB in (1000 records per second)
      * Each shard gets 2MB out
      * You pay per shard provisioned per hour
    * On Demand Mode:
      * No need to provision or manage the capacity
      * Default capacity provisioned (4MB per second)
      * Scales automatically based on observed throughput peak during the last 30 days
      * Pay per stream per hour & data in/out per GB
  * Limits:
    * Producer:
      * 1MB or 1000 messages at write per shard
      * "ProvisionedThroughputException" otherwise
    * Consumer:
      * 2MB at read per shard across all consumers
      * 5 API calls per second per shard across all consumers
    * Data Retention:
      * 24 hours data retention by default
      * Can be extended to 365 days
  * RECAP:
    * Going to write custom code (producer/consumer)
    * Real Time (200ms latency for classic, 70ms latency for enhanced fan-out)
    * Automatic scaling with On-Demand Mode
    * Data Storage for 1 to 365 days, replay capability, multi consumers     
* **Kinesis Firehose**: load streams into S3, Redshift, ElasticSearch & Splunk
  * Fully managed service, no administration
  * Near real time (60 sec latency minimum for non full batches)
  * Automatic scaling
  * Supports many data formats
  * Data conversion from CSV/JSON to Parquet/ORC (only for S3)
  * Data transformation through AWS Lambda
  * Supports compression when target is S3 (ZIP, SNAPPY)
  * Pay for the amount of data going through Firehose  
  * RECAP:
    * Fully managed, send to S3, Splunk, Redshift, ElasticSearch
    * Serverless data transformation with Lambda
    * Near real time (lowest buffer time is 1 minute)
    * Automated scaling 
    * No data storage            
* **Kinesis Analytics**: perform real-time analytics on streams using SQL
  * Use cases:
    * *Streaming ETL*: select columns, make simple transformation on streaming data
    * *Continuous metric generation*: live leaderboard for a mobile game
    * *Responsive analytics*: look for certain criteria and build alerting
  * Features:
    * Pay only for resources consumed (not cheap)
    * Serverless, scales automatically
    * Use IAM permission to access streaming source and destination
    * SQL or Flink to write the computation
    * Schema discovery
    * Lambda for preprocessing
  * ML on Kinesis Data Analytics:
    * *RANDOM_CUT_FOREST*
      * SQL function for **anomaly detection** on numeric columns in a stream
      * Use **recent history** to compute model
    * *HOTSPOTS*
      * Locate and return info about relatively dense regions in your data      
* **Kinesis Video Streams**: meant for streaming video in real-time
  * Producers:
    * Security camera, body worn camera, AWS DeepLens, smartphone camera, radar data
    * One producer per video stream
  * Video playback capability
  * Consumers:
    * Build your own (MXNet, Tensorflow)
    * AWS SageMaker
    * AWS Rekognition Video
  * Keep data for 1 hour to 10 years

<br>

* **Kinesis Summary - ML**
  * *Kinesis Data Stream*: Create real-time ML applications
  * *Kinesis Data Firehose*: Ingest massive data near-real time, e.g. put it into S3, and then perform some training into an algorithm to train ML models
  * *Kinesis Data Analytics*: Real-time ETL/ML algorithms on streams or real-time ML algorithm as Random_Cut_Forest on streams data 
  * *Kinesis Video Streams*: Real-time video stream to create ML applications (face detection)

<br>

**AWS GLUE**
* **Glue Data Catalog**
  * Metadata repository for all your tables
    * Automated schema inference
    * Schemas are versioned
  * Integrates with Athena or Redshift Spectrum (schema & data discovery) 
* **Glue Crawlers**: can help build the Glue Data Catalog
  * Crawlers go through your data to infer schemas and partitions 
  * Works JSON, Parquet, CSV, relational store
  * Crawlers work for S3, Redshift, RDS
  * Run the crawler on a schedule or on demand
  * Need an IAM role/credentials to access the data stores 
* **Glue & S3 Partitions**:
  * Glue crawler will extract partitions based on how your S3 data is organized
  * Think up front about how you will be querying your data lake in S3
  * Do you query primarily by *time ranges* ?
    * Organize your buckets as s3://my-bucket/dataset/yyyy/mm/dd/device
  * Do you query primarily by *device* ?
    * Organize your buckets as s3://my-bucket/dataset/device/yyyy/mm/dd 
* **Glue ETL**
    * Transform data, clean data, enrich data (before analysis)
      * Generate ETL code in Python or Scala, you can modify the code
      * Can provide your own Spark or PySpark scripts
      * Target can be S3, JBDC (RDS, Redshift) or Glue Data Catalog
    * Fully managed, cost effective, pay only for resource consumed
    * Jobs are run on serverless Spark platform
    * *Glue Scheduler*: schedule the jobs
    * *Glue Triggers*: automate job runs based on events
    * **Transformation**
      * *Bundled Transformation*
        * DropFields, DropNullFields - remove null fields
        * Filter - specify a function to filter records
        * Join - to enrich data
        * Map - add fields, delete fields
      * *ML Transformation*
        * **FindMatches ML**: identify duplicate or matching records in your dataset, even when the records don't have a common unique identifier and no field match exactly
      * *Format Conversion*: CSV, JSON, Avro, Parquet
      * *Apache Spark Trasnformation*: e.g. K-Means

<br>

**AWS DATA STORES for ML**
* *RedShift*
  * Data Warehousing, SQL analytics (OLAP - Online Analytical Processing)
  * OLAP -> columnar based (data is organized in columns)
  * Load data from S3 to Redshift
  * Use Redshift Spectrum to query data directly in S3 (no loading)
* *RDS, Aurora*
  * Relational Store, SQL (OLTP - Online Transaction Processing)
  * OLTP -> row based (data is organized in rows)
  * Must provision servers in advance
* *DynamoDB*
  * NoSQL data store, serverless, provision read/write capacity
  * Useful to store a machine learning model server by your application
* *S3*
  * Object storage
  * Serverless, infinite storage
  * Integration with most AWS Services
* *OpenSearch (ElasticSearch)*
  * Indexing of data
  * Search amongst data points
  * Clickstream analytics
* *ElastiCache*
  * Caching mechanism
  * Not really used for ML

<br>

**AWS DATA PIPELINE**
* Destination include S3, RDS, DynamoDB, Redshift and EMR
* Manages task dependencies
* Retries and notifies on failures
* Data sources may be on-premises
* Highly available
* *Data Pipeline vs Glue*
  * **Glue**
    * Glue ETL - run Apache Spark code, Scala or Python based, focus on the ETL
    * Glue ETL - don't worry about configuring or managing the resources
    * Data Catalog to make the data available to Athena or Redshift Spectrum
  * **Data Pipeline**
    * Orchestration service
    * More control over the environment, compute resources that run code 
    * Allows access to EC2 or EMR instances (creates resources in your own account)

<br>

**AWS BATCH**
* Run batch jobs as Docker Images 
* Dynamic provisioning of the instances (EC2 & Spot Instances)
* Optimal quantity and type based on volume and requirements 
* No need to manage clusters, fully serverless
* You just pay for the underlying EC2 instances
* Schedule Batch jobs using CloudWatch Events
* Orchestrate Batch Jobs using AWS Step Function
* *Batch vs Glue*
  * **Glue**
    * Glue ETL - run Apache Spark code, Scala or Python based, focus on the ETL
    * Glue ETL - don't worry about configuring or managing the resources
    * Data Catalog to make the data available to Athena or Redshift Spectrum
  * **Batch**
    * For any computing job regardless of the job (must provide Docker image)
    * Resources are created in your account, managed by Batch
    * For any non-ETL related work, Batch is probably better

<br>

**DMS - DATABASE MIGRATION SERVICE**
* Quickly and securely migrate databases to AWS, resilient, self healing
* The source db remains available during the migration
* Supports:
  * Homogeneous migrations: ex Oracle to Oracle
  * Heterogeneous migrations: ex Microsoft SQL to Aurora
* Continuous Data Replication using CDC
* You must create an EC2 instance to perform the replication tasks
* *AWS DMS vs Glue*
  * **Glue**
    * Glue ETL - run Apache Spark code, Scala or Python based, focus on the ETL
    * Glue ETL - don't worry about configuring or managing the resources
    * Data Catalog to make the data available to Athena or Redshift Spectrum
  * **AWS DMS**
    * Continuous Data Replication
    * No data transformation
    * Once the data is in AWS, you can use Glue to transform it

<br>

**AWS STEP FUNCTION**
* Use to orchestrate and design workflows
* Easy visualizations
* Advanced Error Handling and Retry mechanism outside the code
* Audit of the history of workflows
* Ability to wait for an arbitrary amount of time
* Max execution time of a State Machine is 1 year
* *!! EXAM TIP !!*: Anytime you need to orchestrate some things, or make sure some thing happens and another thing happens then Step Function are the perfect candidate for this

<br>

**AWS DATASYNC**
* For data migrations from on-premises to AWS storage services
* A DataSync Agent is deployed as a VM and connects to your internal storage 
  * NFS, SMB, HDFS
* Encryption and data validation

<br>

**MQTT**
* An Internet of Things (IoT) thing
* Standard messaging protocol
* Think of it as how lots of sensor data might get transferred to your ML model
* The AWS IoT Device SDK can connect via MQTT







## 2. EXPLORATORY DATA ANALYSIS

**TYPES OF DATA**
* **Numerical**
  * Represent some sort of quantitative measurement (heights of people, page load times, stock prices, etc)
  * *Discrete data*: Integer based
  * *Continuous data*: Infinite number of possible values
* **Categorical**
  * Qualitative data that has no inherent mathematical meaning (gender, yes/no, state of residence, etc)
  * You can assign numbers to categories in order to represent them more compactly, but the numbers don't have mathematical meaning
* **Ordinal**
  * A mixture of numerical and categorical
  * Categorical data that has mathematical meaning 
  * Example: movie ratings on a 1-5 scale

<br>

**DATA DISTRIBUTION**
* **Normal/Gaussian Distribution**
  * Gives you the probability of a data point falling within some given range of a given value
* **Probability Mass Function**
  * The way that we visualize the probability of discrete data occuring 
* **Poisson Distribution**
  * Example of probability mass function
  * It deals with discrete data
* **Binomial Distribution**
  * This just describe the number of successes in a sequence of experiments with a yes/no questions    
* **Bernoulli Distribution**
  * Special case of Binomial Distribution
  * Has a single trial (n=1)
  * Can think of a binomial distribution as the sum of Bernoulli distribution

<br>

**TIME SERIES ANALYSIS**
* **Trends**
  * When a time series seem to be trending in one direction
* **Seasonality**
  * The time series tends to peak during certain intervals in a cyclical way
* **Noise**
  * Some variations are just random in nature
  * Seasonality + Trends + Noise = Time Series
    * Additive model
    * Seasonal variation is constant
  * Sometimes Seasonality \* Trends \* Noise - Trends, sometimes amplify Seasonality and Noise
  * Seasonal variation increases as the trend increases

<br>

**AWS ATHENA**
* **What is?**
  * Interactive query service for S3 (SQL)
  * No need to load data, it stays in S3
  * Presto under the hood
  * Serverless
  * Supports many data formats (CSV, JSON, ORC, Parquet, Avro)
  * Unstructured, semi-structured or structured
* **Examples**
  * Ad-hoc queries of web logs
  * Querying staging data before loading to Redshift
  * Analyze CloudTrail/CloudFront/VPC/ELB logs in S3
  * Integration with Jupyter, Zeppelin, RStudio notebooks
  * Integration with QuickSight
  * Integration via ODBC/JDBC with other visualization tools
* **Cost Model**
  * Pay as you go
    * $5 per TB scanned
    * Successful or cancelled queries count, failed queries don't
    * No charge for DDL (CREATE, ALTER, DROP, etc)
  * Save lots of money by using columnar formats
    * ORC, Parquet
    * Save 30%-90% and get better performance
  * Glue and S3 have their own charges
* **Security**
  * Access control
    * IAM, ACLs, S3 bucket policies
    * AmazonAthenaFullAccess / AWSQuickSightAthenaAccess
  * Encrypt results at rest in S3 staging directory
    * Server-side encryption with S3 managed key (SSE-S3)
    * Server-side encryption with KMS key (SSE-KMS)     
    * Client-side encryption with KMS key (CSE-KMS)  
  * Cross-account access in S3 bucket policy possible
  * Transport Layer Security (TLS) encrypts in-transit (between Athena and S3) 
* **Anti Patterns**
  * Highly formatted reports/visualization
    * That's what QuickSight is for
  * ETL
    * Use Glue instead

<br>

**AWS QUICKSIGHT**
* **What is QuickSight?**
  * Business analytics and visualizations in the cloud
  * Fast, easy, cloud-powered business analytics service
  * Allows all employees in an organization to:
    * Build visualization
    * Perform ad-hoc analysis
    * Quickly get business insights from data
    * Anytime, on any device (browsers, mobile)
  * Serverless
* **Quicksight Data Sources**
  * Redshift
  * Aurora/RDS
  * Athena
  * EC2-hosted databases
  * Files (S3 or on-premises)
    * Excel
    * CSV, TSV
    * Common or extended log format 
  * Data preparation allows limited ETL
* **Spice**
  * Data sets are imported into Spice
    * Super fast, parallel, in-memory calculation engine
    * Uses columnar storage, in-memory, machine code generation
    * Accelerates interactive queries on large datasets
  * Each user gets 10GB of Spice
  * Highly available / durable
  * Scales to hundreds of thousands of users
* **QuickSight Use Cases**
  * Interactive ad-hoc exploration / visualization of data
  * Dashboard and KPI's
  * Analyze / visualize data from:
    * Logs in S3
    * On-premises databases
    * AWS (RDS, Redshift, Athena, S3)
    * SaaS applications such as Salesforce
    * Any JDBC/ODBC data source
* **ML Insights**
  * Anomaly Detection
  * Forecasting
  * Auto-narratives
* **QuickSight Anti-Patterns**
  * Highly formatted canned reports
    * QuickSight is for ad-hoc queries, analysis and visualization
  * ETL
    * Use Glue instead, although QuickSight can do some transformations
* **QuickSight Security**
  * Multi factor authentication on your account
  * VPC connectivity
    * Add QuickSight's IP address range to your database security groups
  * Row level security
  * Private VPC access
    * Elastic Network Interface
    * AWS Direct Connect  
* **QuickSight User Management**  
  * Users defined via IAM or email signup
  * Active Directory integration with QuickSight Enterprise Edition
* **QuickSight Pricing** 
  * Annual Subscription
    * Standard: $9/user/month   
    * Enterprise: $18/user/month
  * Extra SPICE capacity (beyond 10GB)
    * $0.25 (standard) - $0.38 (enterprise)/GB/month 
  * Month to month
    * Standard: $12/GB/month
    * Enterprise: $24/GB/month
  * Enterprise edition
    * Encryption at rest
    * Microsoft Active Directory integration
* **QuickSight Visualization**
  * Visual Types:
    * *AutoGraph*: It automatically selects the most appropriate visualization based on the properties of the data itself
    * *Bar Charts*: For comparison and distribution (histograms)
    * *Line Graphs*: For changes over time
    * *Scatter Plots, Heat Maps*: For correlation
    * *Pie Graphs, Tree Maps*: For aggregation
    * *Pivot Tables*: For tabular data, excel format

<br>

**EMR (Elastic MapReduce)**

EMR provides a way of distributing the load of processing the data across an entire cluster of computers. So for massive datasets often you need a cluster to actually process that data and prepare it for your ML training jobs in parallel across an entire cluster
* **What is EMR?**
  * Managed Hadoop framework on EC2 instances
  * Includes Spark, HBase, Presto, Flink, Hive and more
  * EMR notebooks
  * Several integration points with AWS
* **EMR Cluster**
  * *Master Node*:
    * Manages the cluster
    * Single EC2 instance
  * *Core Node*:
    * Hosts HDFS data and runs taks 
    * Can be scaled up & down but with some risk
  * *Task Node*:
    * Runs tasks, doesn't host data
    * No risk of data loss when removing
    * Good use of spot instances
* **EMR Usage**
  * Transient vs Log-Running Clusters
    * *Transient*:
      * It's configured to be automatically termined once all the steps you've defined have been completed (e.g.loading data, processing data, storing output)
      * If you have a predefined sequence that you want your cluster to do
    * *Long-Running*:
      * You want to interact with the application directly and then just manually terminate it
      * Appropriate for experimenting with datasets  
    * Can spin up task nodes using Spot Instances for temporary capacity
    * Can use reserved instances on long-running clusters to save $
  * Connect directly to master to run jobs 
  * Submit ordered steps via the console
  * EMR Serverless lets AWS scale your nodes automatically 
* **EMR / AWS Integration**
  * EC2 for the instances that comprise the nodes in the cluster
  * VPC to configure the virtual network in which you launch your instances
  * S3 to store input and output data
  * CloudWatch to monitor cluster performance and configure alarms
  * IAM to configure permissions
  * CloudTrail to audit requests made to the service
  * Data Pipeline to schedule and start your cluster
* **EMR Storage**
  * HDFS
  * EMRFS
    * Access S3 as if it were HDFS
    * EMRFS Consistent View - optional for S3 consistency
    * Local file system
    * EBS for HDFS
* **EMR Promises**
  * EMR charges by the hour + EC2 charges
  * Provisions new nodes if a core node fails
  * Can add and remove tasks nodes on the fly
  * Can resize a running cluster's core nodes
* **Hadoop** 
  * It's composed of:
    * *HDFS*:
      * It's the base, Hadoop File System, a distributed file system for Hadoop 
      * It distributes the data and stores across the instances in the cluster
      * Multiple copies of the data are stored on different instances to ensure no data is lost if an individual instance fails
    * *YARN*:
      * It's on top of HDFS
      * YARN (Yet Another Resource Negotiator)
      * It centrally manages cluster resources from multiple data processing frameworks
    * *MapReduce*:
      * It's a software framework for easily writing applications that process vast amounts of data in parallel on large clusters of commodity hardware in a reliable fall tolerant manner
* **Spark**
  * It's an open source distrubuted processing system commonly used for big data workloads and it's better than MapReduce
  * It utilizes in memory cachin, optimizes query execution for fast analytic queries against data of any size
  * It has API for Python, Java, Scala and R
  * Use cases: 
    * Stream Processing
    * Interactive SQL
    * ML
  * Spark components:
    * *Spark Core*: the foundation of the platform, it's responsible for things like memory management, fault recovery, scheduling, distributing and interacting with storage systems
    * *Spark SQL*: Distributed query engine that provides low latency interactive queries up to 100 times faster than MapReduce. Columnar storage and code generation for fast queries and it supports various data sources (JBDC, ODBC, JSON, HDFS, Hive, ORC and Parquet). It exposes dataframes in Python
    * *Spark Streaming*: streaming analytics, data gets ingested in mini batches and it can be integrated with AWS Kinesis
    * *MLLib*: ML library for Spark
      * Classification: Logistic Regression, Naive Bayes
      * Regression
      * Decision Tree
      * Recommendation engine (ALS)
      * Clustering (K-Means)
      * LDA (topic modeling)
      * ML workflow utilities (pipelines, feature transformation, persistence)
      * SVD, PCA, statistics
    * *GraphX*: distributed graph processing framework 
  * Zeppelin + Spark
    * Can run Spark code interactively (like in Spark shell)
      * This speeds up your development cycle
      * Allows easy experimentation and exploration of your big data
    * Can execute SQL queries directly against SparkSQL
    * Query results may be visualized in charts and graphs
    * Makes Spark feel more like a data science tool
* **EMR Notebook**
  * Similar concept to Zeppelin, with more AWS integration
  * Notebooks backed up to S3
  * Provision clusters from the notebook
  * Hosted inside a VPC
  * Accessed only via AWS console
* **EMR Security**
  * IAM policies
  * Kerberos
  * SSH
  * IAM roles
* **EMR Instance Types**
  * *Master node*:
    * m4.large if < 50 nodes  
    * m4.xlarge if > 50 nodes
  * *Core & task nodes*:
    * m4.large is usually good
    * If cluster waits a lot in external dependencies: t2.medium
    * Improved performance: m4.xlarge
    * Computation-intensive applications: high CPU instances
    * Database, memory-caching applications: high memory instances
    * Network/CPU intensive (NLP, ML): cluster computer instances
  * *Spot instances*:
    * Good choice for task nodes
    * Only use on core & master if you're testing or very cost-sensitive; you're risking partial data loss

<br>

**FEATURE ENGINEERING**
* **What is Feaure Engineering?**
  * Applying your knowledge of the data (and the model you are using) to create better features to train your model with
    * Which features should I use?
    * Do I need to transform these features in some way?
    * How do I handle missing data?
    * Should I create new features from the existing ones?
  * You can't just throw in raw data and expect good results
  * This is the art of ML, where expertise is applied
  * "Applied ML is basically feature engineering" - Andrew Ng
* **The Curse of Dimensionality**
  * Too many features can be a problem - leads to sparse data
  * As we keep adding more and more dimensions, the available space that we have to work with just keeps exploding
  * More features you have, the larger the space that we can find a solution is within
  * Every feature is a new dimension
  * Much of feature engineering is selecting the features most relevant to the problem at hand
    * This often is where domain knowledge comes into play
  * Unsupervised dimesnionality reduction techniques can also be employed to distill many features into fewer features 
    * PCA
    * K-Means
* **Imputing Missing Data**
  * **Mean Replacement**
    * Replace missing values with the mean value from the rest of the column
    * Fast & easy, won't affect mean or sample size of overall data set
    * Median may be a better choice than mean when outliers are present
    * But it's generally pretty terrible:
      * Only works on column level, misses correlations between features
      * Can't use on categorical features
      * Not very accurate
  * **Dropping**
    * If not many rows contains missing data
      * and dropping those rows doesn't bias your data ...
      * and you don't have a lot of time ...
      * maybe it's a reasonable thing to do
    * But it's never going to be the right answer for the best approach
    * Almost anything is better
  * **Machine Learning**
    * **KNN**: Find K nearest (most similar) rows and average their values
      * Assumes numerical data, not categorical
      * These are ways to handle categorical data, but categorical data is probably better served by DL
    * **Deep Learning**
      * Build a ML model to impute data for your ML model
      * Works well for categorical data. Really well. But it's complicated
    * **Regression**
      * Find linear or non-linear relationship between the missing feature and other features
      * Most advanced technique: MICE (Multiple Imputation by Chained Equations)
  * **Just get more data**
    * What's better than imputing data? Getting more real data
    * Sometimes you just have to try harder or collect more data

<br>

**HANDLING UNBALANCED DATA**
* **What is Unbalanced Data?**
  * Large discrepancy between "positive" and "negative" cases:
    * Fraud detection: fraud is rare and most rows will be not-fraud
    * Don't let the terminology confuse you: positive doesn't mean "good":
      * It means the thing you're testing for is what happened
      * If your ML model is made to detect fraud, then fraud is the positive case
  * Mainly a problem with NN
* **Oversampling**
  * Duplicate samples from the minority class
  * Can be done at random
* **Undersampling** 
  * Instead of creating more positive samples, remove negatives ones
  * Throwing data away is usually not the right answer
    * Unless you are specifically trying to avoid big data scaling issue
* **SMOTE**
  * Synthetic Minority Over-Sampling Technique
  * Artificially generate new samples of the minority class using nearest neighbors
    * Run K-nearest-neighbors of each sample of the minority class 
    * Create a new sample from the KNN result (mean of the neighbors)
  * Both generates new samples and undersamples majority class
  * Generally better than just oversampling
* **Adjusting Thresholds**
  * When making predictions about a classification (fraud/not fraud) you have some sort of threshold of probability at which point you'll flag something as the positive case (fraud)
  * If you have too many false positive, one way to fix that is to simply increase that threshold
    * Guaranteed to reduce false positive
    * But could result in more false negatives

<br>

**HANDLING OUTLIERS**
* **Variance** 
  * Variance ($σ^2$) is simply the average of the squared differences from the mean
  * $σ^2 = \frac{\sum (x-μ)^2}{N}$
    * *x*: the example
    * *μ*: the average value
    * *N*: the total number of values
  * Square is applied for 2 reasons:
    * We want negative variance to count just as much as positive variance
    * Also we want to give more weight to the outliers, so this amplifies the effect of the values very different from the mean
* **Standard Deviation**
  * Standard Deviation ($σ$) is just the square root of the Variance
  * This is usually used as a way to identify outliers
  * Data points that lie more than one standard deviation from the mean can be considered unusual
  * You can talk about how extreme a data point is by talking about "how many sigmas" away from the mean it is
* **Dealing with Outliers**
  * Sometimes it's appropriate to remove outliers from your training data
  * Do this responsably, understand why you are doing this
  * For example: in collaborative filtering, a single user who rates thousands of movies could have a big effect on everyone else's rating. That may not be desirable
  * Another example: in web log data, outliers may represent bots or other agents that should be discarded
  * But if someone really wants the mean income of US citizen for example, don't toss out billionaires just because you want to
  * Our old friend standard deviation provides a principled way to classify outliers
  * Find data points more than some multiple of a standard deviation in your training data
  * What multiple? You just have to use common sense
  * Remember AWS's Random Cut Forest algorithm creeps into many of its services - it's made for outlier detection
    * Found within QuickSight, Kinesis Analytics, SageMaker and more
* **Binning**
  * Bucket observations together based on ranges of values 
  * Example: estimated ages of people
    * Put all 20-somethings in one classification, 30-something in another, etc
  * Quantile binnning categorizes data by their place in the data distribution
    * Ensures even sizes of bins
  * Transforms numeric data to ordinal data
  * Especially useful when there is uncertainty in the measurements
* **Transforming**
  * Applying some function to a feature to make it better suited for training
  * Feature data with an exponential trend may benefit from a logarithmic transform
* **Encoding**
  * Transforming data into some new representation required by the model
  * **One-hot encoding**
    * Create buckets for every category
    * The bucket for your category has a 1, all others have a 0
    * Very common in Deep Learning where categories are represented by individual output neurons
* **Scaling/Normalization**
  * Some models prefer feature data to be normally distributed around 0 (most NN)
  * Most models require feature data to at least be scaled to comparable values
    * Otherwise features with larger magnitudes will have more weight than they should
    * Example: modeling age and income as features - incomes will be much higher values than ages
  * Scikit_learn has a preprocessor module that helps (MinMaxScaler, etc)
  * Remember to scale your results back up
* **Shuffling**
  * Many algorithms benefit from shuffling their training data
  * Otherwise they may learn from residual signals in the training data resulting from the order in which they were collected

<br>

**SAGEMAKER GROUND TRUTH**
* **What is Ground Truth?**
  * Sometimes you don't have training data at all, and it needs to be generated by human first
  * Example: training an image classification model. Somebody needs to tag a bunch of images with what they are images of before training a neural network
  * Ground Truth manages humans who will label your data for training purpose
  * Ground Truth creates its own model as images are labeled by people
  * As this model learns, only images the model isn't sure about are sent to human labelers
  * This can reduce the cost of labeling jobs by 70%
  * Who are these himan labelers?
    * AWS Mechanical Turk
    * Your own internal team
    * Professional labeling companies
* **Ground Truth Plus**
  * Turnkey solution
  * "Our team of AWS Experts" manages the workflow and team of labelers
    * You fill out an intake form
    * They contact you and discuss pricing
  * You track progress via the Ground Truth Plus Project Portal
  * Get labeled data from S3 when done
* **Other Ways to Generate Training Labels**
  * *Rekognition*
    * AWS service for image recognition
    * Automatically classify images
  * *Comprehend*
    * AWS service for text analysis and topic modeling
    * Automatically classify text by topics, sentiment
  * Any pre-trained model or unsupervised technique that may be helpful            



## 3. DEEP LEARNING

**INTRO TO DEEP LEARNING**
* **Biological Inspiration**
  * Neurons in your cerebral cortex are connected via axons
  * A neuron fires to the neuron it's connected to, when enough of its input signal are activated
  * Very simple at the individual neuron level but layers of neuron connected in this way can yield learning behaviour
  * Billions of neurons, each with thousands of connections, yields a mind
  * Neurons in your cortex seem to be arranged into many stacks or **columns** that process information in parallel
  * Mini-columns of around 100 neurons are organized into larger hyper-columns. There are 100 million mini-columns in your cortex
  * This is coincidentally similar to how GPU's work
* **How NN works?**
  * It just sums up weighted inputs from the layer below and applies some sort of activation function to that weight and it passes the result up to the next layer
  * The network is trained using data with known labels, with one-hot encoding, and during the training process it figures out the ideal weights between each neuron to get the right answer at the top
* **Types of NNs**
  * FeedForward Neural Network    
  * **Convolutional Neural Network** (CNN)
    * Image classification 
  * **Recurrent Neural Network** (RNN)
    * Deals with sequences in time (predict stock prices, NLP, translation, etc)
    * **LSTM** (Long Short Term Memory), **GRU** (Gated Recurrent Unit)
* **NN frameworks**
  * **TensorFLow**: a very popular choice, which is made by Google, and it also incorporates a higher level API called **Keras** 
  * **MXNet**: alternative to TensorFlow, it's made by Apache and maybe for that reason Amazon tends to gravitate toward MXNet more than TensorFlow

<br>

**ACTIVATION FUNCTIONS**
* **What are and types**
  * It's the function inside a given node/neuron that sums up all of the incoming inputs into that neuron and decides what output it should then send out to the next layer of neuron. 
  * **Linear or Identity Function**
    * It doesn't really do anything
    * Can't do backpropagation (no learning)
  * **Binary Step Function**
    * It's on or off
    * Can't handle multiple classification - it's binary after all
    * Vertical slopes don't work well with calculus (infinite derivative , the math blow up)
  * Instead we need **Non-Linear Activation Functions**   
    * These can create complex mappings between inputs and outputs
    * Allow backpropagation (because they have a useful derivative)
    * Allow for multiple layers (linear functions degenerate to a single layer)
  * **Sigmoid Function (Logistic)**
    * Nice & smooth 
    * Scales everything from 0 to 1
    * Changes slowly for high of low values
      * The *Vanishing Gradient Problem*
    * Computationally expensive
  * **Tanh Function**
    * Nice & smooth 
    * Scales everything from -1 to 1
    * Changes slowly for high of low values
      * The *Vanishing Gradient Problem*
    * Computationally expensive
    * Tanh generally preferred over Sigmoid because when you are dealing with ML it's nice to have things with a mean around 0
  * **Rectified Linear Unit (ReLU)**
    * Very popular choice
    * Easy and fast to compute
    * But when inputs are 0 or negative we have a linear function and all of its problems
      * The *Dying ReLu problem*
  * **Leaky ReLU**
    * Solves *Dying ReLU Problem* by introducing a negative slope below 0
  * **Parametric ReLU (PReLU)**
    * Leaky ReLU but the slope in the negative part is learned via backpropagation
    * Complicated and YMMV
  * **Other ReLU variants**
    * **Exponential Linear Unit (ELU)
    * **Swish**
      * From Google, performs really well
      * But it's not from Amazon...
      * Mostly a benefit with very deep networks (40+ layers)
    * **Maxout**
      * Outputs the max of the inputs
      * Technically ReLU is a special case of maxout
      * But doubles parameters that need to be trained is not often practical
  * **Softmax**
    * Used on the final layer of a multiple classification problem
    * Basically converts outputs to probabilities of each classification
    * Can't produce more than one label for something (sigmoid can)          
    * Don't worry about the actual function for the exam, just know what it's used for
* **Choosing an activation function**
  * For multiple classification use **softmax** on the output layer
  * RNN's do well with **tanh**
  * For everything else:
    * Start with **ReLU**
    * If you need to do better try **Leaky ReLU**
    * Last resort: **PReLU**, **Maxout**
    * **Swish** for really deep networks

<br>

**CONVOLUTIONAL NEURAL NETWORKS (CNN)**
* **What are CNNs for?**
  * When you look for some pattern or feature in your data but you don't know where exactly it might be
    * Images that you want to find features within
    * Machine translation
    * Sentence Analysis
    * Sentiment Analysis
  * They can find features that aren't in a specific spot
    * Like a stop sign in a picture
    * Words within a sentence
  * They are "feature-location invariant"
* **How do CNNs work?**
  * Inspired by the biology of the visual cortex
    * Local receptive fields are groups of neurons that only respond to a part of what your eyes see (subsampling)
    * They overlap each other to cover the entire visual field (convolutions)
    * They feed into higher layers that identify increasingly complex images
      * Some receptive fields identify horizontal lines, lines at different angles, etc (filters)
      * These would feed into a layer that identifies shapes
      * Which might feed into a layer that identifies objects
    * For color images, extra layers for red, green and blue
  * **CNN** is just taking a source image or a source data of any sort, breaking it up into little chunks called *convolutions* and then we assemble those and look for patterns at increasingly higher complexitiesin your NN
* **How do we know that's a stop sign?**
  * Individual local receptive fields scan the image looking for edges and pick up the edges of the stop sign in a layer
  * Those edges in turn get picked up by a higher level convolutional that identifies the stop sign's shape (and letters)
  * This shape then gets matched against your pattern of what a stop sign look like also using the strong red signal coming from your red layers
  * That information keeps getting processed upward until your foot hits the brake
  * A **CNN** works the same way
* **CNNs with Keras/Tensorflow**
  * Source data must be of appropriate dimensions
    * width x length x color channels
  * Conv2D layer type does the actual convolution on a 2D image 
    * Conv1D and Conv3D also available - doesn't have to be image data
  * MaxPooling2D layers can be used to reduce a 2D layer down by taking the maximum value in a given block
  * Flatten layers will convert the 2D layer to a 1D layer for passing into a flat hiden layer of neurons
  * Typical usage:
    1. Conv2D
    2. MaxPooling2D
    3. Dropout
    4. Flatten
    5. Dense
    6. Dropout
    7. Softmax
* **CNNs are hard**
  * Very resource-intensive (CPU, GPU and RAM)
  * Lots of hyperparameters
    * Kernel size, many layers with different numbers of units, amount of pooling... in addition to the usual stuff like number of layers, choice of optimizer
  * Getting the training data is often the hardest part (as well as storing and accessing it)
* **Specialized CNN Architectures**
  * Defines specific arrangement of layers, padding and hyperparameters
  * **LeNet-5**
    * Good for handwritting recognition
  * **AlexNet**
    * Image classification, deeper than LeNet
  * **GoogLeNet**
    * Even deeper but with better performance
    * Introduces *inception modules* (groups of convolutional layers)
  * **ResNet (Residual Network)**
    * Even deeper - maintains performance via skip connections

<br>

**RECURRENT NEURAL NETWORKS (RNN)**
* **What are RNNs for?**
  * Time-Series data
    * When you want to predict future behaviour based on past behaviour
    * Web logs, sensor logs, stock trades
    * Where to drive your self-driving car based on past trajectories
  * Data that consists of sequences of arbitrary length
    * Machine Translation
    * Image Captions
    * Machine-Generated music
* **RNN Topologies**
  * *Sequence to Sequence*
    * e.g. predict stock prices based on series of historical data
  * *Sequence to Vector*
    * e.g. words in a sentence to sentiment
  * *Vector to Sequence* 
    * e.g. create captions from an image
  * *Encoder -> Decoder*
    * Sequence -> Vector -> Sequence
    * e.g. machine translation
* **Training RNNs**
  * Backpropagation through time
    * Just like backpropagation on MLP's but applied to each time step
  * All those time steps add up fast
    * Ends up looking like a really, really deep NN
    * Can limit backpropagation to a limited number of time steps (truncated backpropagation through time)
  * State from earlier time steps get diluted over time
    * This can be a problem, for example when learning sentence structures
    * If we're looking at words in a sentence, the words at the beginning of a sentence might be even more important than words toward the end
  * **LSTM Cell**
    * Long Short-Term Memory Cell
    * Maintains separate short-term and long-term states
  * **GRU Cell**
    * Gated Recurrent Unit
    * Simplified LSTM Cell that performs about as well
  * Training is really hard:
    * Very sensitive to topologies, choice of hyperparameters
    * Very resource intensive
    * A wrong choice can lead to a RNN that doesn't converge at all

<br>

**MODERN NLP WITH BERT AND GPT AND TRANSFER LEARNING**
* **Modern NLP (Natural Language Processing)**
  * **Transformer** deep learning architectures are what's hot
    * Adopts mechanism of **self-attention**
      * Weights significance of each part of the input data
      * Processes sequential data (like word, an RNN) but processes entire input all at once
      * The attention mechanism provides context, so no need to process one word at a time
    * **BERT**, RoBERTa, T5, **GPT-2**, DistillBERT
    * DistillBERT uses knowledge distillation to reduce model size by 40%
  * **BERT**: Bi-directional Encoder Representations from Transformers
  * **GPT**: Generative Pre-trained Transformers
* **Transfer Learning**
  * NLP models (and others) are too big and complex to build from scratch and re-train every time
    * The latest may have hundreds of billions of parameters
  * Model zoos such as **Hugging Face** offer pre-trained models to start from
    * Integrated with Sagemaker via Hugging Face Deep Learning Containers
  * You can fine-tune these models for your own use cases
* **Transfer Learning: BERT example**
  * Hugging Face offers a Deep Learning Container (DLC) for BERT
  * It's a pre-trained on BookCorpus and Wikipedia
  * You can fine-tune BERT (or DistillBERT etc) with your own additional training data through transfer learning
    * Tokenize your own training data to be of the same format
    * Just start training it further with your data, with a low learning rate
* **Transfer Learning Approaches**
  * Continue training a pre-trained model (**fine-tuning**)
    * Use for fine-tuning a model that has way more training data than you'll ever have
    * Use a low learning rate to ensure you are just incrementally improving the model
  * Add new trainable layers to the top of a frozen model
    * Learns to turn old features into predictions on new data
    * Can do both: add new layers then fine tune as well
  * Retrain from scratch 
    * If you have large amount of training data and it's fundamentally different from what the model was pre-trained with
    * And you have the computing capacity for it!
  * Use it as-is
    * When the model's training data is what you want already

<br>

**DEEP LEARNING ON EC2 AND EMR**
* EMR supports Apache MXNet and GPU instance types
* Appropriate instance types for deep learning:
  * P3: 8 Tesla V100 GPUs
  * P2: 16 K80 GPUs
  * G3: 4 M60 GPUs (all NVIDIA chips)
* Deep Learning AMIs

<br>

**TUNING NEURAL NETWORKS**
* **Learning Rate**
  * Neural networks are trained by gradient descent (or similar means)
  * We start at some random point and sample different solutions (weights) seeking to minimize some cost function over many *epochs*
  * How far apart these samples are is **learning rate**
* **Effect of Learning Rate**
  * Too high learning rate means you might overshoot the optimal solution
  * Too small learning rate will take too long to find the optimal solution
  * Learning Rate is an example of a *hyperparameter*
* **Batch Size**
  * How many training samples are used within each batch of each epoch
  * Somewhat counter-intuitively:
    * Smaller batch sizes can work their way out of *local minima* more easily
    * Batch sizes that are too large can end up getting stuck in the wrong solution
    * Random shuffling at each epoch can make this look like very incostitent results from run to run
* **To Recap (! important for exam !)**
  * *Small Batch Sizes* tend to not get stuck in local minima
  * *Large Batch Sizes* can converge on the wrong solution at random
  * *Large Learning Rates* can overshoot the correct solution
  * *Small Learning Rate* increase training time

<br>

**NEURAL NETWORKS REGULARIZATION TECHNIQUES**
* **What is Regularization?**
  * Preventing overfitting
    * Models that are good at making predictions on the data they were trained on but not on new data it hasn't seen before
    * Overfitted models have learned patterns in the training data that don't generalize to the real world
    * Often seen as high accuracy on training data set but lower accuracy on test or evaluation data set
      * When training and evaluating a model we use *training*, *evaluating* and *testing* data sets
  * Regularization techniques are intended to prevent overfitting
* **Techniques to prevent overfitting**
  * *Complex structure*:
    * Too many layers
    * Too many neurons
  * **Dropout**
    * Just removing some neurons at random at each training step to force your model to spread its learning out a little bit better and that has a regularization effect that prevents overfitting
  * **Early Stopping**
    * Algorithm that detect if the validation accuracy is leveled out while the training accuracy is still increasing so we should probably stop

<br>

**L1 and L2 REGULARIZATION**
* **What L1 & L2 are?**
  * Prevent overfitting in ML in general
  * A regularization term is added as weights are learned
  * **L1** term is the sum of the weights (abs of the weights)
  * **L2** term is the sum of the square of the weights
  * Same idea can be applied to loss function
* **What's the difference?**
  * L1: sum of weights multiplied by λ
    * Performs *feature selection* - entire features go to 0
    * Computationally inefficient
    * Sparse output
  * L2: sum of square of weights multiplied by λ
    * All features remain considered, just weighted
    * Computationally efficient
    * Dense output
* **Why would you want L1?**
  * Feature selection can reduce dimensionality
    * Out of 100 features, maybe only 10 end up with non-zero coefficients
    * The resulting sparsity can make up for its computational inefficienty
  * But if you think all of your features are important, L2 is probably a better choice

<br>

**GRIEF WITH GRADIENTS**
* **The Vanishing Gradient Descend Problem**
  * When the slope of the learning curve approaches zero, things can get stuck
  * We end up working with very small numbers that slow down training or even introduce numerical errors
  * Becomes a problem with deeper networks and RNNs as these "vanishing gradients" propagate to deeper layers
  * Opposite problem: **Exploding Gradients**
* **Fixing the Vanishing Gradient Problem**
  * Multi-level heirarchy
    * Break up levels into their own sub-networks trained individually
  * Long Short-Term Memory (LSTM)
  * Residual Networks
    * ResNet
    * Ensemble of shorter networks
  * Better choice of the activation function
    * ReLU is a good choice
* **Gradient Checking**
  * A debugging technique
  * Numerically check the derivatives computed during training
  * Useful for validating code of neural network training 
    * But you're probably not going to be writing this code in the ML industry

<br>

**CONFUSION MATRIX**
* **Sometimes Accuracy Doesn't Tell The Whole Story**
  * A test for a rare disease can be 99.9% accurate by just guessing "no" all the time
  * We need to understand true positives and true negatives as well as false positives and false negatives
  * A confusion matrix show this
  * Check multi-class matrix confusion example from the AWS docs

<br>

**MEASURING YOUR MODELS**
* **Recall**
  * Formula: TP/(TP+FN)
  * AKA Sensitivity, True Positive Rate, Completeness
  * % of positives rightly predicted
  * Good choice of metric when you care a lot about false negatives
    * e.g. fraud detection
* **Precision**
  * Formula: TP/(TP+FP)
  * AKA Correct Positives
  * % of relevant results
  * Good choice of metric when you care a lot about false positives
    * e.g. medical screening, drug testing
* **Specificity**
  * Formula: TN/(TN+FP)
  * True negative rate
* **F1-Score**
  * Formula: 2TP/(2TP+FP+FN)
  * Formula: 2((Precision*Recall)/(Precision+Recall))
  * Harmonic mean of precision and recall
  * When you care about precision and recall  
* **RMSE**
  * Root Mean Squared Error
  * Accuracy measurement
  * Only cares about right and wrong answers
* **ROC Curve**
  * Receiver Operating Characteristic Curve
  * Plot of true positive rate (recall) vs false positive rate at various threshold settings
  * Points above the diagonal (y=x) represent good classification (better than random)
  * Ideal curve would just be a point in the upper-left corner
  * The more it's bent toward the upper left the better
* **AUC**
  * The area under the ROC curve
  * Area Under the Curve (AUC)
  * Equal to probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one
  * ROC AUC of 0.5 is a useless classifier, 1.0 is perfect
  * Commonly used metric for comparing classifiers

<br>

**ENSEMBLE LEARNING**
* **Ensemble Methods**
  * Common example: **Random Forest**
    * Decision Trees are prone to overfitting
    * So, make lots of decision trees and let them all vote on the result
    * This is a random forest
    * But how do these trees differ? Using the following concepts...
* **Bagging**
  * Generate N new training sets by random sampling with replacement
  * Each resampled model can be trained in parallel
* **Boosting**
  * Observations are weighted
  * Some will take part in new training sets more often
  * Training is sequential, each classifier takes into account the previous one's success
  * How it works: start with equal weights on each observation, at each stage we reweight the data and the model, and we run again. Iterating on that we'll get better and better results
* **Bagging vs Boosting**
  * **XGBoost** is the latest hotness
  * Boosting generally yields better accuracy
  * But bagging avoids overfitting
  * Bagging is easier to parallelize
  * So, depends on your goal          




## 4. AMAZON SAGEMAKER

**INTRODUCING AMAZON SAGEMAKER**
* **What is AWS Sagemaker?**
  * **SageMaker** is built to handle the entire ML workflow
    * Fetch, clean and prepare data
    * Train and evaluate a model
    * Deploy model and evaluate results in production
  * **SageMaker** Notebooks can direct the process
    * Notebook instances on EC2 are spun up from the console
    * S3 data access
    * Scikit_learn, Spark, Tensorflow
    * Wide variety of built-in models
    * Ability to spin up training instances
    * Ability to deploy trained models for making predictions at scale
    * You can use the SageMaker console instead of notebooks but here you can't write code
* **Data Prep on SageMaker**
  * Data usually comes from S3
    * Ideal format varies with algorithm - often it is RecordIO/Protobuf
  * Can also ingest from Athena, EMR, Redshift and Amazon Keyspaces DB
  * Apache Spark integrates with Sagemaker
  * Scikit_learn, numpy, pandas all at your disposal within a notebook
*  **Training on SageMaker**
  * Create a training job
    * URL of S3 bucket with training data
    * ML compute resources
    * URL of S3 bucket for output
    * ECR path to training code
  * Training options
    * Built-in training algorithms
    * Spark MLLib
    * Custom Python Tensorflow/MXNet code
    * Your own Docker image
    * Algorithm purchased from AWS marketplace
* **Deploying Trained Model**
  * Save your trained model to S3
  * Can deploy 2 ways:
    * Persistent endpoint for making individual predictions on demand
    * SageMaker Batch Transform to get predictions for an entire dataset
  * Lots of cool options
    * Inference Pipelines for more complex processing
    * SageMaker Neo for deploying to edge devices 
    * Elastic Inference for accelerating deep learning models
    * Automatic scaling (increase # of endpoints as needed)

<br>

**--- SAGEMAKER'S BUILT-IN ALGORITHMS ---**

**LINEAR LEARNER**
* **What is it for?**
  * Linear Regression
    * Fit a line to your training data
    * Predications based on that line
  * Can handle both regression (numeric) predictions and classification predictions
    * For classification, a linear threshold function is used
    * Can do binary or multi-class
* **What training input does it expect?**
  * RecordIO-wrapped protobuf 
    * Float32 data only!
  * CSV
    * First column assumed to be the label
  * File or Pipe mode both supported
* **How is it used?**
  * Preprocessing 
    * Training data must be *normalized* (so all features are weighted the same*)
    * Linear Learner can do this for you automatically
    * Input data should be *shuffled*
  * Training
    * Uses Stochastic Gradient Descent  
    * Choose an optimization algorithm (Adam, AdaGrad, SGD, etc)
    * Multiple models are optimized in parallel 
    * Tune L1, L2 regularization
  * Validation
    * Most optimal model is selected
* **Important Hyperparameters**
  * *Balance_multiclass_weights*
    * Gives each class equal importance in loss functions
  * *Learning_rate*, *mini_batch_size*
  * *L1*
    * Regularization
  * *Wd*
    * Weight decay (L2 regularization)
* **Instance Types**
  * Training
    * Single or multi-machine CPU or GPU
    * Multi-GPU does not help

<br>

**XGBOOST**
* **What is it for?**
  * eXtreme Gradient Boosting 
    * Boosted group of decision trees
    * New trees made to correct the erros of previous trees
    * Uses gradient descent to minimize loss as new trees are added
  * It's been winning a lot of Kaggle competitions
    * And it's fast too
  * Can be used for classification
  * And also for regression
    * Using regression trees
* **What training input does it expect?**
  * XGBoost is weird, since it's not made for SageMaker. It's just open source XGBoost
  * So, it takes CSV or libsvm input  
  * AWS recently extended it to accept recordIO-protobuf and Parquet as well
* **How is it used?**
  * Models are serialized/deserialized with Pickle
  * Can use as a framework within notebooks
    * Sagemaker.xgboost
  * Or as a built-in SageMaker algorithm
* **Important Hyperparameters**
  * There are a lot of them
  * *Subsample*
    * Prevents overfitting
  * *Eta*
    * Step size shrinkage, prevents overfitting
  * *Gamma*
    * Minimum loss reduction to create a partition, larger = more conservative
  * *Alpha*
    * L1 regularization term, larger = more conservative
  * *Lambda*
    * L2 regularization term, larger = more conservative
  * *eval_metric*
    * Optimize on AUC, error, RMSE
    * For example, if you care about false positives more than accuracy, you might use AUC here
  * *scale_pos_weight*
    * Adjusts balance of positive and negative weights
    * Helpful for unbalanced classes
    * Might set to sum(negative cases) / sum(positive cases)
  * *max_depth*
    * Max depth of the tree
    * Too high and you may overfit     
* **Instance Types**
  * Uses CPUs only for multiple instance training
  * Is memory-bound, not compute-bound
  * So, **M5** is a good choice
  * As of XGBoost 1.2, single instance GPU training is available
    * For example **P3**
    * Must set *tree_method* hyperparameter to *gpu_hist*
    * Trains more quickly and can be more cost effective

<br>

**SEQ2SEQ**
* **What is it for?**
  * Sequence to Sequence
  * Input is a sequence of tokens, output is a sequence of tokens
  * Machine Translation
  * Text Summarization
  * Speech to Text
  * Implemented with RNNs and CNNs with attention
* **What training input does it expect?**
  * RecordIO-Protobuf
    * Tokens must be integers (this is unusual, since most algorithms want floating point data)
  * Start with tokenized text files
  * Convert to protobuf using sample code
    * Packs into integer tensors with vocabulary files
    * A lot like the TF/IDF lab we did earlier
  * Must provide training data, validation data and vocabulary files
* **How is it used?**
  * Training for machine translation can take days even on SageMaker
  * Pre-trained models are available
    * See the example notebook
  * Public training datasets are available for specific translation tasks
* **Important Hyperparameters**
  * *Batch_size*
  * *Optimizer_type* (adam, sgd, rmsprop)
  * *Learning_rate*
  * *Num_layers_encoder*
  * *Num_layers_decoder*
  * Can optimize on:
    * Accuracy
      * vs provided validation dataset
    * BLEU score
      * Compares against multiple reference translations
    * Perplexity
      * Cross-entropy
* **Instance Types**
  * Can only use GPU instance types (**P3** for example)
  * Can only use a single machine for training 
    * But can use multi-GPUs on one machine

<br>

**DEEPAR**
* **What is it for?**
  * **Forecasting** one-dimensional time series data
  * Uses RNNs
  * Allows you to train the same model over several related time series
  * Finds frequencies and seasonality
* **What training input does it expect?**
  * JSON lines format
    * Gzip or Parquet
  * Each record must contain:
    * Start: the starting time stamp
    * Target: the time series values
  * Each record can contain:
    * *Dynamic_feat*: dynamic features (such as, was a promotion applied to a product in a time series or product purchases)
    * *Cat*: categorical features   
* **How is it used?**
  * Always include entire time series for training, testing and inference
  * Use entire dataset as a test set, remove last time points fro training. Evaluate on withheld values
  * Don't use very large values for prediction length (> 400)
  * Train on many time series and not just one when possible
* **Important Hyperparameters**
  * *Context_length*
    * Number of time points the model sees before making a prediction
    * Can be smaller than a seasonalities, the model will lag one year anyhow
  * *Epochs*
  * *mini_batch_size*
  * *Learning_rate*
  * *Num_cells*
* **Instance Types**
  * Can use CPU or GPU
  * Single or multi machine
  * Start with CPU (C4.2xlarge, C4.4xlarge)
  * Move up to GPU if necessary
    * Only helps with larger models
  * CPU only for inference
  * May need larger instances for tuning

<br>

**BLAZINGTEXT**
* **What is it for?**
  * Text classification
    * Predict labels for a sentence
    * Useful in web searches, information retrieval
    * Supervised
  * Word2vec
    * Word embedding layer
    * Creates a vector representation of words
    * Semantically similar words are represented by vectors close to each other
    * This is called a *word embedding*
    * It's useful for NLP but is not an NLP algorithm in itself
      * Used in machine translation, sentiment analysis
    * Remember it only works on individual words not sentences or document    
* **What training input does it expect?**
  * For supervised mode (text classification):
    * One sentence per line
    * First "word" in the sentence is the string \__label__ followed by the label
    * Also, "augmented manifest text format"
    * Word2vec just wants a text file with one training sentence per line
* **How is it used?**
  * Word2vec has multiple modes
    * Cbow (Con tinuous Bag of Words)
    * Skip-gram
    * Batch skip-gram
      * Distributed computation over many CPU nodes 
* **Important Hyperparameters**
  * Word2vec:
    * *Mode (batch_skipgram, skipgram, cbow)*
    * *Learning_rate*
    * *Window_size*
    * *Vector_dim*
    * *Negative_samples*
  * Text Classification
    * *Epochs*
    * *Learning_rate*
    * *Word_ngrams*
    * *Vector_dim*  
* **Instance Types**
  * For cbow and skipgram, recommend a single *ml.p3.2xlarge*
    * Any single CPU or single GPU instance will work
  * For *batch_skipgram* can use a single or multiple CPU instances
  * For text classification, *C5* recommended if less than 2GB of data. For larger data sets use a single GPU instance (*ml.p2.xlarge or ml.p3.2xlarge*) 

<br>

**OBJECT2VEC**
* **What is it for?**
  * Remember word2vec from BlazingText? It's like that but arbitrary objects
  * It creates low-dimensional dense embeddings of high-dimensional objects
  * It's basically word2vec generalized to handle things other than words
  * Compute nearest neighbors of objects
  * Visualize clusters
  * Genre prediction
  * Recommendations (similar items or users)
* **What training input does it expect?**
  * Data must be tokenized inot integers
  * Training data consists of pairs of tokens and/or sequences of tokens
    * Sentence - sentence
    * Labels - sequence (genre to description?)
    * Customer - customer
    * Product - product
    * User - item
* **How is it used?**
  * Process data into JSON Lines and shuffle it
  * Train with two input channels, two encoders and a comparator
  * Encoder choices:
    * Average-pooled embeddings
    * CNNs
    * Bidirectional LSTM
  * Comparator is followed by a feed-forward neural network  
* **Important Hyperparameters**
  * The usual deep learning ones
    * *Dropout*, *early_stopping*, *epochs*, *learning_rate*, *batch_size*, *layers*, *activation function*, *optimizer*, *weight decay*
  * Enc1_network, enc2_network
    * Choose hcnn, bilstm, pooled_embedding  
* **Instance Types**
  * Can only train on a single machine (CPU or GPU, multi-GPU OK)
    * ml.m5.2xlarge
    * ml.p2.xlarge
    * If needed go up to ml.m5.4xlarge or ml.m5.12xlarge
  * Inference: use ml.p2.2xlarge
    * Use INFERENCE_PREFERRED_MODE environment variable to optimize for encoder embeddings rather than classification or regression

<br>

**OBJECT DETECTION**
* **What is it for?**
  * Identify all objects in an image with bounding boxes
  * Detects and classifies objects with a single deep neural network
  * Classes are accompanied by confidence scores
  * Can train from scratch or use pre-trained models based on ImageNet
* **What training input does it expect?**
  * RecordIO or image format (jpg or png)
  * With image format, supply a JSON file for annotation data for each image
* **How is it used?**
  * Takes an image as input, outputs all instances of objects in the image with categories and confidence scores
  * Uses a CNN with the Single Shot multibox Detector (SSD) algorithm
    * The base CNN can be VGG-16 or ResNet-50
  * Transfer Learning mode / incremental training
    * Use a pre-trained model for the base network weights, instead of random initial weights
  * Use flips, rescale and jitter internally to avoid overfitting    
* **Important Hyperparameters**
  * *Mini_batch_size*
  * *Learning_rate*
  * *Optimizer*
    * *sgd*, *adam*, *rmsprop*, *adadelta*
* **Instance Types**
  * Use GPU instances for training (multi GPU and multi-machine OK)
    * ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8clarge, ml.p3.16xlarge
  * Use CPU or CPU for inference
    * C5, M5, P2, P3 all OK

<br>

**IMAGE CLASSIFICATION**
* **What is it for?**
  * Assign one or more labels to an image
  * Doesn't tell you where objects are, just what objects are in the image
* **What training input does it expect?**
  * Apache MXNet RecordIO
    * Not Protobuf
    * This is for interoperability with other deep learning frameworks
  * Or raw jpg or png images
  * Image format requires .lst files to associate image index, class label and path to the image
  * Augmented manifest image format enables Pipe mode  
* **How is it used?**
  * ResNet CNN under the hood
  * Full training mode
    * Network initialized with random weights
  * Transfer learning mode
    * Initialized with pre-trained weights
    * The top fully-connected layer is initialized with random weights
    * Network is fine-tuned with new training data
  * Default image size is 3-channel 224x224 (ImageNet's dataset)    
* **Important Hyperparameters**
  * The usual suspects for deep learning
    * *Batch size*, *learning rate*, *optimizer*
  * Optimizer-specific parameters
    * Weight decay, beta 1, beta 2, eps, gamma  
* **Instance Types**
  * GPU instances for training (P2, P3)
  * Multi-GPU and multi-machine is OK
  * CPU or GPU for inference (C4, P2, P3)

<br>

**SEMANTIC SEGMENTATION**
* **What is it for?**
  * Pixel-level object classification
  * Different from image classification - that assigns labels to whole images
  * Different from object detection - that assigns labels to bounding boxes 
  * Useful for self-driving vehicles, medical imaging diagnostic, robot sensing
  * Produces a *segmentation mask* 
* **What training input does it expect?**
  * JPG images and PNG annotations
  * For both training and validation
  * Label maps to describe annotations
  * Augmented manifest image format supported for Pipe mode
  * JPG images accepted for inference
* **How is it used?**
  * Built on MXNet Gluon and Gluon CV
  * Choice of 3 algorithms:
    * Fully-Convolutional Network (FCN)
    * Pyramid Scene Parsing (PSP)
    * DeepLabV3
  * Choice of backbones:
    * ResNet50
    * ResNet101
    * Both trained on ImageNet
  * Incremental training or training from scratch, supported too    
* **Important Hyperparameters**
  * *Epochs*, *learing_rate*, *batch_size*, *optimizer*
  * Algorithm
  * Backbone  
* **Instance Types**
  * Only GPU supported for training (P2 or P3) on a single machine only
    * Specifically ml.p2.xlarge, ml.p2.8xlarge, ml.p2.16xlarge, ml.p3.2xlarge, ml.p3.8xlarge, ml.p3.16xlarge 
  * Inference on CPU (C5 or M5) or GPU (P2 or P3)

<br>

**RANDOM CUT FOREST**
* **What is it for?**
  * Anomaly Detection
  * Unsupervised
  * Detect unexpected spikes in time series data
  * Breaks in periodicity
  * Unclassifiable data points
  * Assigns an anomaly score to each data point
  * Based on an algorithm developed by Amazon that they seem to be very proud of!
* **What training input does it expect?**
  * RecordIO-protobuf or CSV
  * Can use File or Pipe mode on either
  * Optional test channel for computing accuracy, precision, recall and F1 on labeled data (anomaly or not)
* **How is it used?**
  * Creates a forest of trees where each tree is a partition of the training data; looks at expected change in complexity of the tree as a result of adding a point into it
  * Data is sampled randomly
  * Then trained
  * RCF shows up in Kinesis Analytics as well; it can work on streaming data too
* **Important Hyperparameters**
  * *Num_trees*
    * Increasing reduces noise
  * *Num_samples_per_tree*
    * Should be chosen such that 1/*num_samples_per_tree* approximates the ratio of anomalous to normal data
* **Instance Types**
  * Does not take advantage of GPUs
  * Use M4, C4 or C5 for training
  * ml.c5.xl for inference

<br>

**NEURAL TOPIC MODEL**
* **What is it for?**
  * Organize document into topics
  * Classify or summarize documents based on topics
  * It's not just TF/IDF
    * "bike", "car", "train", "mileage" and "speed" might classify a document as "transportation" for example (although it wouldn't know to call it that)
  * Unsupervised
    * Algorithm is "Neural Variational Inference"  
* **What training input does it expect?**
  * Four data channels:
    * "train" is required
    * "validation", "test" and "auxiliary" optional
  * recordIO-protobuf or CSV
  * Words must be tokenized into integers
    * Every document must contain a count for every word in the vocabulary in CSV
    * The "auxiliary" channel is for the vocabulary
  * File or Pipe mode (Pipe always faster)    
* **How is it used?**
  * You define how many topics you want
  * These topics are a latent representation based on top ranking words
  * One of two topic modeling algorithms in SageMaker - you can try them both!
* **Important Hyperparameters**
  * Lowering *mini_batch* and *learning_rate* can reduce validation loss
    * At expense of training time
  * Num_topics  
* **Instance Types**
  * GPU or CPU
    * GPU recommended for training
    * CPU OK for inference
    * CPU is cheaper

<br>

**LATENT DIRICHLET ALLOCATION (LDA)**
* **What is it for?**
  * Another topic modeling algorithm
    * Not Deep Learning
  * Unsupervised
    * The topics themselves are unlabeled; they are just groupings of documents with a shared subset of words
  * Can be used for things other than words
    * Cluster customers based on purchased
    * Harmonic analysis in music    
* **What training input does it expect?**
  * Train channel, optional test channel
  * recordIO-protobuf of CSV
  * Each document has counts for every word in vocabulary (in CSV format)
  * Pipe mode only supported with recordIO
* **How is it used?**
  * Unsupervised; generates however many topics you specify
  * Optional test channel can be used for scoring results
    * Per-word log likelihood
  * Functionally similar to NTM but CPU-based
    * Therefore maybe cheaper / more efficient  
* **Important Hyperparameters**
  * *Num_topics*
  * *Alpha0*
    * Initial guess for concentration parameter
    * Smaller values generate sparse topic mixtures
    * Larger values (> 1.0) produce uniform mixture
* **Instance Types**
  * Single-instance CPU training

<br>

**K-NEAREST-NEIGHBORS (KNN)**
* **What is it for?**
  * Simple classification or regression algorithm
  * Classification
    * Find the K closest points to a sample point and return the most frequent label
  * Regression
    * Find the K closest points to a sample point and return the average value   
* **What training input does it expect?**
  * Train channel contains your data
  * Test channel emits accuracy or MSE
  * recordIO-protobuf or CSV training
    * First column is label
  * File or Pipe mode on either  
* **How is it used?**
  * Data is first sampled
  * SageMaker includes a dimensionality reduction stage
    * Avoid sparse data ("curse of dimensionality")
    * At cost of noise/accuracy
    * "sign" of "fjlt" methods
  * Build an index for looking up neighbors
  * Serialize the model
  * Query the model for a given K   
* **Important Hyperparameters**
  * *K*
  * *Sample_size*
* **Instance Types**
  * Training on CPU or GPU
    * ml.m5.2xlarge 
    * ml.p2.xlarge
  * Inference
    * CPU for lower latency
    * GPU for higher throughput on large batches

<br>

**K-MEANS CLUSTERING**
* **What is it for?**
  * Unsupervised clustering 
  * Divide data into K groups, where members of a group are similar as possible to each other
    * You define what "similar" means
    * Measured by Euclidean distance
  * Web-scale K-Means clustering  
* **What training input does it expect?**
  * Train channel, optional test
    * Train ShardedByS3Key, Test FullyReplicated
  * recordIO-protobuf or CSV
  * File or Pipe on either  
* **How is it used?**
  * Every observation mapped to n-dimensional space (n = number of features)
  * Works to optimize the center of K clusters
    * "extra cluster centers" may be specified to improve accuracy (which end up getting reduced to k)
    * K = k * x
  * Algorithm:
    * Determine initial cluster centers
      * Random or K-Means++ approach
      * K-Means tries to make initial clusters far apart
    * Iterate over training data and calculate cluster centers
    * Reduce clusters from K to k
      * Using Lloyd's method with K-Means++    
* **Important Hyperparameters**
  * K
    * Choosing K is tricky
    * Plot within-cluster sum of squares as function of K
    * Use "Elbow-Method"
    * Basically optimize for tightness of clusters
  * *Mini_batch_size*
  * *Extra_center_factor*
  * *Init_method*   
* **Instance Types**
  * CPU or GPU, but CPU recommended 
    * Only one GPU per instance used on GPU
    * So use p*.xlarge if you're going to use GPU

<br>

**PRINCIPAL COMPONENT ANALYSIS**
* **What is it for?**
  * Dimensionality reduction
    * Project higher-dimensional data (lots of features) into lower-dimensional (like a 2D plot) while minimizing loss of information
    * The reduced dimensions are called componenets
      * First component has the largest possible variability
      * Second component has the next largest variability etc
  * Unsupervised    
* **What training input does it expect**
  * recordIO-protobuf or CSV
  * File or Pipe on either
* **How is it used?**
  * Covariance matrix is created, then singular value decomposition (SVD)
  * Two modes:
    * *Regular*: for sparse data and moderate number of observations and features
    * *Randomized*: for large number of observations and features and it uses approximation algorithm
* **Important Hyperparameters**
  * *Algorithm_mode*
  * *Subtract_mean*
    * Unbias data
* **Instance Types**
  * GPU or CPU
    * It depends "on the specific of the input data"

<br>

**FACTORIZATION MACHINES**
* **What is it for?**
  * Dealing with sparse data
    * Click prediction
    * Item recommendations
    * Since an individual user doesn't interact with most pages/products the data is sparse
  * Supervised
    * Classification or regression
  * Limited to pair-wise interactions
    * User -> item for example    
* **What training input does it expect**
  * recordIO-protobuf with Float32
    * Sparse data means CSV isn't practical because you just end up with a huge list of commas because the vast majority of the items are not gonna have any data associated with them
* **How is it used?**
  * Finds factors we can use to predict a classification (click or not? Purchase or not?) or value (predicted rating?) given a matrix representing some pair of things (users & items?)
  * Usually used in the context of recommender systems
* **Important Hyperparameters**
  * Initialization methods for bias, factors and linear terms
    * Uniform, normal or constant
    * Can tune properties of each method
* **Instance Types**
  * CPU or GPU 
    * CPU recommended
    * GPU only works with dense data

<br>

**IP INSIGHTS**
* **What is it for?**
  * Unsupervised learning of IP address usage patterns
  * Identifies suspicious behaviour from IP addresses
    * Identify logins from anomalous IPs
    * Identify accounts creating resources from anomalous IPs
* **What training input does it expect?**
  * User names, account IDs can be fed in directly, no need to pre-process
  * Training channel, optional validation (computes AUC score)
  * CSV only 
    * Entity, IP
* **How is it used?**
  * Uses a neural network to learn latent vector representations of entities and IP addresses
  * Entities are hashed and embedded
    * Need sufficiently large hash size
  * Automatically generates negative samples during training by randomly pairing entities and IPs  
* **Important Hyperparameters**
  * *Num_entity_vectors*
    * Hash size
    * Set to twice the number of unique entity identifiers
  * *Vector_dim*
    * Size of embedding vectors
    * Scales model size
    * Too large results in overfitting
  * Epochs, learning rate, batch size, etc  
* **Instance Types**
  * CPU or GPU
    * GPU recommended
    * ml.p3.2xlarge or higher
    * Can use multiple GPUs
    * Size of CPU instance depends on *vector_dim* and *num_entity_vectors*

<br>

**REINFORCEMENT LEARNING**
* **What is RL?**
  * You have some sort of agent that "explores" some space
  * As it goes, it learns the value of different state changes in different conditions
  * Those values inform subsequent behaviour of the agent
  * Examples: Pac-Man, Cat & Mouse game (game AI)
    * Supply chain management
    * HVAC systems
    * Industrial robotics
    * Dialog systems
    * Autonomous vehicles
  * Yields fast on-line performance once the space has been explored
* **Q-Learning**
  * A specific implementation of RL
  * You have:
    * A set of environmental states *s*
    * A set of possible actions in those state *a*
    * A value of each state/action *Q*
  * Start off with Q values of 0
  * Explore the space
  * As bad things happen after a given state/action, reduce its *Q*
  * As rewards happen after a given state/action, increase its *Q*
* **The Exploration Problem**
  * How do we efficiently explore all of the possible states?
    * Simple approach: always choose the action for a given state with the highest *Q*. If there's a tie, choose at random
      * But that's really inefficient and you might miss a lot of paths that way
    * Better way: introduce an epsilon term
      * If a random number is less than epsilon, don't follow the highest *Q* but choose at random
      * That way exploration never totally stops
      * Choosing epsilon can be tricky
* **Fancy Words**
  * Markov Decision Process
    * From Wikipedia: **Markov Decision Processes (MDPs)** provide a mathematical framework for modeling decision making in situations where outcome are partly random and partly under the control of a decision maker
    * Sound familiar? MDPs are just a way to describe what we just did using mathematical notation
    * States are still described as *s* and *s'*
    * State transition functions are described as *Pa*(*s*,*s'*)
    * Our *Q* values are described as a reward function *Ra*(*s*,*s'*)
  * Even fancier words! An MDP is a *discrete time stochastic control process*
* **Recap**
  * You can make an intelligent Pac-Man in a few steps:
    * Have it semi-randomly explore different choices of movement (actions) given different conditions (states)
    * Keep track of the reward or penalty associated with each choice for a given state/action (*Q*)
    * Use those stored *Q* values to inform its future choices
  * Pretty simple concept. But hey, now you can say you understand RL, Q-Learning, Markov Decision Processes and Dymanic Programming!
* **RL in SageMaker**
  * Uses a DL framework with Tensorflow and MXNet
  * Supports Intel Coach and Ray Rllib toolkits
  * Custom, open-source or commercial environments supported
    * MATLAB, Simulink
    * EnergyPlus, RoboSchool, PyBullet
    * Amazon Sumerian, AWS RoboMaker
* **Distributed Training with SageMaker RL**
  * Can distribute training and/or environment rollout
  * Multi-core and multi-instance
* **RL Key Terms**
  * *Environment*
    * The layout of the board/maze/etc
  * *State*
    * Where the player/pieces are
  * *Action*
    * Move in a given direction, etc
  * *Reward*
    * Value associated with the action from that state
  * *Observation*
    * e.g. surrounding in a maze, state of a chess board
* **Hyperparameter Tuning**
  * Parameters of your choosing may be abstracted
  * Hyperparameter tuning in SageMaker can then optimize them
* **Instance Types**
  * No specific guidance given in developer guide
  * But, it's deep learning - so GPUs are helpful
  * And we know it supports multiple instances and cores

<br>

**AUTOMATIC MODEL TUNING**
* **HyperParameter Tuning**
  * How do you know the best values of learning rate, batch size, depth, etc?
  * Often you have to experiment
  * Problem blows up quickly when you have many different hyperparameters; need to try every combination of every possible value somehow, train a model and evaluate it every time
* **Automatic Model Tuning**
  * Define the hyperparameters you care about and the ranges you want to try, and the metrics you are optimizing for
  * SageMaker spins up a "HyperParameter Tuning Job" that trains as many combinations as you'll allow
    * Training instances are spun up as needed, potentially a lot of them 
    * The set of hyperparameters producing the best results can then be deployed as a model
    * **It learn as it goes**, so it doens't have to try every possible combination
* **Best Practices**
  * Don't optimize too many hyperparameters at once
  * Limit your ranges to as small a range as possible
  * Use logarithmic scales when appropriate
  * Don't run too many training jobs concurrently
    * This limits  how well the process can learn as it goes
  * Make sure training jobs running on multiple instances report the correct objective metric in the end

<br>

**APACHE SPARK**
* **Integrating SageMaker and Spark**
  * Pre-process data as normal with Spark
    * Generate DataFrames
  * Use sagemaker-spark library
  * SageMakerEstimator 
    * KMeans, PCA, XGBoost
  * SageMakerModel
* **The Way It Works**
  * Connect notebook to a remote EMR cluster running Spark (or use Zeppelin)
  * Training dataframe should have:
    * A features column that is vector of Doubles
    * An optional labels column of Doubles
  * Call fit on your SageMakerEstimator to get a SageMakerModel
  * Call transform on the SageMakerModel to make inferences
  * Works with Spark Pipelines as well
* **Why Bother?**
  * Allows you to combine pre-processing big data in Spark with training and inference in SageMaker

<br>

**NEW SAGEMAKER FEATURES**
* **SageMaker Studio**
  * Visual IDE for ML
  * Integrates many of the features we're about to cover
* **SageMaker Notebooks**
  * Create and share Jupyter Notebooks with SageMaker Studio
  * Switch between hardware configurations (no infrastructure to manage)
* **SageMaker Experiments**
  * Organize, capture, compare and search your ML jobs

<br>

**SAGEMAKER DEBUGGER**
* **What is?**
  * Saves internal model state at periodical intervals
    * Gradients/tensors over time as a model is trained
    * Define rules for detecting unwanted conditions while training 
    * A debug job is run for each rule you configure
    * Logs & fires a CloudWatch event when the rule is hit
  * SageMaker Studio Debugger dashboards
  * Auto-generated training reports
  * Built-in rules:
    * Monitor system bottlenecks
    * Profile model framework operations
    * Debug model parameters
  * Supported Frameworks & Algorithms:
    * Tensorflow
    * PyTorch
    * MXNet
    * XGBoost
    * SageMaker generic estimator (for use with custom training containers)
  * Debugger APIs available in GitHub
    * Construct hooks & rules for CreateTrainingJob and DescribeTrainingJob APIs
    * SMdebug client library lets you register hooks for accessing training data
* **Newer SageMaker Debugger Features**
  * SageMaker Debugger Insights Dashboards
  * Debugger ProfilerRule
    * ProfilerReport
    * Hardware system metrics (CPUBottleneck, GPUMemoryIncrease, etc)
    * Framework Metrics (MaxInitializationTime, OverallFrameworkMetrics, StepOutlier)
  * Built-in actions to receive notifications or stop training
    * StopTraining(), Email() or SMS()
    * In response to Debugger Rules
    * Sends notifications vis SNS
  * Profiling system resource usage and training

<br>

**SAGEMAKER AUTOPILOT/AUTOML**
* **SageMaker Autopilot**
  * Automates:
    * Algorithm selection
    * Data preprocessing
    * Model tuning
    * All infrastructure
  * It does all the trial & error for you
  * More broadly this is called AutoML
  * Can add in human guidance
  * With or without code in SageMaker Studio or AWS SDKs
  * Problem Types:
    * Binary calssification
    * Multiclass classification
    * Regression
  * Algorithm Types:
    * Linear Learner
    * XGBoost
    * Deep Learning (MLPs)
  * Data must be tabular CSV    
* **SageMaker Autopilot Workflow**
  * Load data from S3 for training
  * Select your target column for prediction
  * Automatic model creation
  * Model notebook is available for visibility & control
  * Model leaderboard
    * Ranked list of recommended models
    * You can pick one
  * Deploy & monitor the model, refine via notebook if needed
* **Autopilot Explainability**
  * Integrates with SageMaker Clarify
  * Transparency on how models arrive at predictions
  * Feature attribution
    * Uses SHAP Baselines/Shapley Values
    * Research from cooperative game theory
    * Assigns each feature an importance value for a given prediction

<br>

**SAGEMAKER MODEL MONITOR**
* **What is?**
  * Get alerts on quality deviations on your deployed models (via CloudWatch)
  * Visualize data drift
    * Example: loan model starts giving people more credit due to drifting or missing input features
    * Detect anomalies & outliers
    * Detect new features
    * No code needed
  * Data is stored in S3 and secured
  * Monitoring jobs are scheduled via a Monitoring Schedule
  * Metrics are emitted to CloudWatch
    * CloudWatch notifications can be used to trigger alarms
    * You'd then take corrective action (retrain the model, audit the data)
  * Integrates with Tensorboard, QuickSight and Tableau
    * Or just visualize within SageMaker Studio
  * Monitoring Types:
    * Drift in data quality
      * Relative to a baseline you create
      * "Quality" is just statistical properties of the features
    * Drift in model quality (accuracy, etc)
      * Works the same way with a model quality baseline
      * Can integrate with Ground Truth labels
    * Bias drift
    * Feature attribution drift
      * Based on Normalized Discounted Cumulative Gain (NDCG) score
      * This compares feature ranking of training vs live data    
* **SageMaker Model Monitor + Clarify**
  * Integrates with SageMaker Clarify
    * SageMaker Clarify detects potential bias
    * e.g. imbalances across different groups/ages/income brackets
    * With ModelMonitor you can monitor for bias and be alerted to new potential bias via CloudWatch
    * SageMaker Clarify also helps explain model behaviour
      * Understand which features contribute the most to your predictions 

<br>

**OTHER RECENT FEATURES**
* **SageMaker in 2021**
  * SageMaker Jumpstart
    * One-click models and algorithms from model zoos
    * Over 150 open source models in NLP, object detections, image classification, etc
  * SageMaker Data Wrangler
    * Import/transform/analyze/export data within SageMaker Studio
  * SageMaker Feature Store
    * Find, discover and share features in Studio
    * Online (low latency) or offline (for training or batch inference) modes
    * Features organized into Feature Groups
  * SageMaker Edge Manager
    * Software agent for edge devices
    * Model optimized with SageMaker Neo
    * Collects and samples data for monitoring, labeling and retraining

<br>

**SAGEMAKER CANVAS**
* **What is?**
  * No-code ML for business analyst
  * Upload csv data (csv only for now), select a column to predict, build it and make predictions
  * Can also join datasets
  * Classification or Regression
  * Automatic data cleaning
    * Missing values
    * Outliers
    * Duplicates
  * Share models & datasets with SageMaker Studio
* **The Finer Points**
  * Local file uploading must be configured "by your IT administrator"
    * Set up an S3 bucket with appropriate CORS permissions
  * Can integrate with Okta SSO (if you want people be able to sign in)
  * Canvas lives within a SageMaker Domain that must be manually updated
  * Import from Redshift can be set up
  * Time series forecasting must be enabled via IAM
  * Can run withina VPC
  * Pricing is $1.90/hr plus a charge based on number of training cells in a model

<br>

**BIAS MEASURES IN CANVAS**
* **Pre-Training Bias Metrics in Clarify**
  * Class Imbalance (CI)
    * One facet (demographic group) has fewer training values than another
  * Difference in Proportions of Labels (DPL)
    * Imbalance of positive outcomes between facet values
  * Kullback-Leibler Divergence (KL), Jensen-Shannon Divergence (JS)
    * How much outcome distributions of facets diverge
  * Lp-norm (LP)
    * P-norm difference between distributions of outcomes from facets
  * Total Variation Distance (TVD)
    * L1-norm difference between distributions of outcomes from facets
  * Kolmogorov-Smirnov (KS)
    * Maximum divergence between outcomes in distributions from facets
  * Conditional Demographic Disparity (CDD)
    * Disparity of outcomes between facets as a whole and by subgroups

<br>

**SAGEMAKER TRAINING COMPILER**
* **What is it?**
  * Integrated into AWS Deep Learning Containers (DLCs)
    * Can't bring your own container
  * Compile & optimize training jobs on GPU instances
  * Can accelerate trainin up to 50%
  * Converts models into hardware-optimizer instructions
  * Tested with Hugging Face transformers library, or bring your own model
  * Incompatible with SageMaker distributed training libraries
  * Best practices:
    * Ensure GPU instances are used (ml.p3, ml.p4)
    * PyTorch models must use PyTorch/XLAs model save
    * Enable debug flag in *compiler_config* parameter to enable debugging  




## 5. HIGH-LEVEL ML SERVICES

**AMAZON COMPREHEND**
* **What is it?**
  * Natural Language Processing (NLP) and Text Analytics
  * Input social media, emails, web pages, documents, transcripts, medical records (Comprehend Medical)
  * Extract key phrases, entities, sentiment, language, syntax, topics and document classifications
  * Can train on your own data
  * Some features:
    * *Entities*: it can extract entities in a text, the important objects that exists within that text and categorizes them for you (e.g. "Seattle" -> Location, "Jeff Bezos" -> Person, etc) with a confidence score of the prediction
    * *Key Phrases*: breaking up the sentence into biggest parts (phrases) with confidence score
    * *Language*: looking at a text it says what language it is with confidence score
    * *Sentiment*: (neutral, positive, negative, mixed) a score of confidence is assigned to these sentiments
    * *Syntax*: instead of classifying things on what they are, we're classifying them by the part of speech that they are (proper noun, punctuation, verb, adposition, etc)

<br>

**AMAZON TRANSLATE**
* **What is it?**
  * Use Deep Learning for translation
  * Supports custom terminology
    * In CSV or TMX format
    * Appropriate for proper names, brand names, etc
  * It detects the source language in automatic

<br>

**AMAZON TRANSCRIBE**
* **What is it?**
  * Speech to Text
    * Input in FLAC, MP3, MP4 or WAV in a specific language
    * Streamig audio supported (HTTP/2 or WebSocket)
      * French, English or Spanish only
  * Speaker Identification
    * Specify number of speakers
  * Channel Identification
    * e.g. two callers could be transcribed separately
    * Merging based on timing of "utterances"
  * Automatic Language Identification
    * You don't have to specify a language; it can detect the dominant one spoken
  * Custom Vocabularies
    * Vocabulary Lists (just a list of special words - names, acronyms)
    * Vocabulary Tables (can include "SoundsLike", "IPA" and "DisplayAS")

<br>

**AMAZON POLLY**
* **What is it?**
  * Neural Text-To-Speech, many voices & languages
  * Lexicons
    * Customize pronunciation of specific words & phrases
    * Example: "World Wide Web Consortium" instead of "W3C"
  * SSML
    * Alternative to plain text
    * Speech Synthetis Markup Language
    * Gives control over emphasis, pronunciation, breathing, whispering, speech rate, pitch, pauses
  * Speech Marks
    * Can encode when sentence/word starts and ends in the audio stream
    * Useful for lip-synching animation

<br>

**AMAZON REKOGNITION**
* **What is it?**
  * Computer vision
  * Object and scene detection
    * Can use your own face collection
  * Image moderation
  * Facial analysis
  * Celebrity recognition
  * Face comparison
  * Text in image
  * Video analysis
    * Objects/people/celebrities marked on timeline
    * People Pathing
* **The Nitty Gritty**
  * Images come from S3 or provide image bytes as part of request
    * S3 will be faster if the image is already there
  * Facial recognition depends on good lighting, angle, visibility of eyes, resolution
  * Video must come from Kinesis Video Streams
    * H.264 encoded
    * 5-30 FPS
    * Favor resolution over framerate
  * Can use with Lambda to trigger image analysis upon upload
* **Rekognition Custom Labels (2020)**
  * Train with a small set of labeled images
  * Use your own labels for unique items
  * Example: the NFL uses custom labels to identify team logos, pylons and foam fingers in images  

<br>

**AMAZON FORECAST**
* **What is it?**
  * Fully-managed service to deliver highly accurate forecasts with ML
  * "AutoML" chooses best model for your time series data
    * ARIMA, DeepAR, ETS, NPTS, Prophet
  * Works with any time series
    * Price, promotions, economic performance, etc
    * Can combine with associated data to find relationship
  * Inventory planning, financial planning, resource planning
  * Based on "dataset groups", "predictors" and "forecasts"
* **Forecast Algorithms**
  * *CNN-QR*
    * CNN - Quantile Regression
    * Best for large datasets with hundreds of time series
    * Accepts related historical time series data & metadata
    * Very computational expensive model
  * *DeepAR+*
    * RNN
    * Best for large datasets
    * Accepts related forward-looking time series & metadata
    * Very computational expensive model
  * *Prophet*
    * Additive model with non-linear trends and seasonality
    * Mid range amount of resources needed
  * *NPTS*
    * Non-Parametric Time Series
    * Good for sparse data. Has variants for seasonal/climatological forecasts
    * Light model
  * *ARIMA*
    * AutoRegressive Integrated Moving Average
    * Commonly used for simple datasets (< 100 time series)
  * *ETS*
    * Exponential Smoothing
    * Commonly used for simple datasets (< 100 time series) 

<br>

**AMAZON LEX**
* **What is it?**
  * Billed as the inner workings of Alexa
  * Natural-Language chatbot engine
  * A bot is built around Intents
    * Utterances invoke intents ("I want to order a pizza")
    * Lambda functions are invoked to fulfill the intent
    * Slots specify extra information needed by the intent
      * Pizza side, toppings, crust type, when to deliver, etc
  * Can deploy to AWS Mobile SDK, Facebook Messanger, Slack and Twilio
* **Amazon Lex Automated Chatbot Designer**
  * You provide existing conversation transcripts
  * Lex applies NLP & DL, removing overlaps and ambiguity
  * Intents, user requests, phrases, values for slots are extracted
  * Ensures intents are well defined and separated
  * Integrates with Amazon Connect transcripts

<br>

**AMAZON PERSONALIZE**
* **What is it?**
  * Fully-managed recommender engine
    * Same one Amazon uses
  * API access
    * Feed in data (purchases, ratings, impressions, cart adds, catalog, user demographics, etc) via S3 or API integration
    * You provide an explicit schema in Avro format 
    * Javascript or SDK
    * GetRecommendations
      * Recommended products, content, etc
      * Similar items
    * GetPersonalizedRanking
      * Rank a list of items provided
      * Allows editorial control/curation (if you want to push specific products)  
  * Console and CLI too
* **Amazon Personalize Features**
  * Real-time or batch recommendations
  * Recommendations for new users and new items (the cold start problem)
  * Contextual recommendations
    * Device type, time, etc
  * Similar items
  * Unstructured text input
  * Intelligent user segmentation
    * For marketing campaigns
* **Amazon Personalize Terminology**
  * Datasets
    * Users, Items, Interactions
  * Recipes
    * USER_PERSONALIZATION
    * PERSONALIZED_RANKING
    * RELATED_ITEMS
  * Solutions
    * Trains the model
    * Optimizes for relevance as well as your additional objectives
      * Video length, price, etc - must be numeric
    * Hyperparameter Optimization (HPO - automatic optimization)
  * Campaigns
    * Deploys your "solution version"
    * Deploys capacity for generating real-time recommendations
* **Amazon Personalize Hyperparameters**
  * User-Personalization, Personalized-Ranking
    * *hidden_dimension* - (HPO)
    * *bptt* - (back-propagation through time - RNN) 
      * the older is an event is the less it count, this RNN gives more weight to recent things
    * *recency_mask* - (weights recent events)
    * *min_max_user_history_length_percentile* - (filter out robots) (I don't want to put attention on people who saw few (1 or 2) or a lot of products (2-300))
    * *exploration_weight* - 0-1, controls the relevance of your results
    * *exploration_item_age_cut_off* - how far back in time you go while you're doing that exploration
  * Similar-items
    * *item_id_hidden_dimension* (HPO)
    * *item_metadata_hidden_dimension* (HPO with min & max range specified) 
* **Maintaining Relevance**
  * Keep your datasets current
    * Incremental data import
  * Use PutEvents operation to feed in real-time user behaviour
  * Retrain the model
    * They call this a new *solution version*
    * Updates every 2 hours by default
    * Should do a full retrain (trainingMode = FULL) weekly
* **Amazon Personalize Security**
  * Data not shared across accounts
  * Data may be encrypted with KMS
  * Data may be encrypted at rest in your region (SSE-S3)
  * Data in transit between your account and Amazon's internal systems encrypted with TLS 1.2
  * Access control via IAM
  * Data in S3 must have appropriate bucket policy for Amazon Personalize to process it
  * Monitoring & logging via CloudWatch and CloudTrail
* **Amazon Personalize Pricing**
  * Data ingestion: per GB
  * Training: per training-hour
  * Inference: per TPS-hour (Transaction Per Second)
  * Batch recommendations: per user or per item

<br>

**OTHER ML SERVICES**
* **Amazon Textract**
  * OCR (Optical Character Recognition) with forms, fields and table supports
  * It's a technology that recognizes text within a digital image
* **Amazon DeepRacer**
  * RL powered 1/18-scale race car
* **Amazon DeepLens**
  * Deep Learning-enabled video camera
  * Integrated with Rekognition, SageMaker, Polly, Tensorflow, MXNet, Caffè
  * It's good for prototyping new ideas
* **Amazon Lookout**
  * Industrial Application
  * Equipment, metrics, vision
  * Detects abnormalities from sensor data automatically to detect equipment issues
  * Monitors metrics from S3, RDS, Redshift, 3rd party SaaS apps
  * Vision uses computer vision to detect defects in silicon wafers, circuit boards, etc
* **Amazon Monitron**
  * Industrial Application
  * End to end system for monitoring industrial equipment and predictive maintenance
* **TorchServe**
  * Model serving framework for PyTorch
  * Part of the PyTorch open source project from Meta (FB)
* **Amazon Neuron**
  * SDK for ML inference specifically on AWS Inferentia chips
  * EC2 Inf1 instance type
  * Integrated with SageMaker or whatever else you want (deep learning AMIs, containers, TensorFlow, PyTorch, MXNet)  
* **Amazon Panorama**
  * Computer vision at the edge
  * Like DeepLens but more general
  * Brings computer vision to your existing IP cameras
* **Amazon DeepComposer**
  * AI-powered keyboard
  * Composes a melody into an entire song
  * For educational purposes
* **Amazon Fraud Detector**
  * Upload your own historical fraud data
  * Builds custom models from a template
  * Exposes an API for your online application
  * Assess risk from:
    * New accounts
    * Guest checkout
    * "Try before you buy" abuse
    * Online payments
* **Amazon CodeGuru**
  * Automated code reviews!
  * Finds lines of code that hurt performance
  * Resource leaks, race conditions
  * Offers specific recommendations
  * Powered by ML
  * Supports Java and Python
* **Contact Lens for Amazon Connect**
  * For customer support call centers
  * Ingests audio data from recorded calls
  * Allows search on calls/chats
  * Sentiment Analysis
  * Find "utterances" that correlate with successful calls
  * Categorize calls automatically
  * Measure talk speed and interruptions
  * Theme detection: discover emerging issues
* **Amazon Kendra**
  * Enterprise search with Natural Language
  * For example: "Where is the IT support desk?", "How do I connect to my VPN?"
  * Combines data from file systems, SharePoint, intranet, sharing services (JDBC, S3) into one searchable repository
  * ML-powered - uses thumbs up/down feedback
  * Relevance tuning - boost strength of document freshness, view counts, etc
  * Alexa's sister? Maybe that's one way to remember it
* **Amazon Augmented AI (A2I)**
 * Human review of ML predictions 
 * Builds workflows for reviewing low-confidence predictions
 * Access the Mechanical Turk workforce or vendors
 * Integrated into Amazon Textract and Rekognition
 * Integrates with SageMaker
 * Very similar to Ground Truth

<br> 

**PUTTING THE BLOCKS TOGETHER**
* Build your own Alexa!
  * Transbribe -> Lex -> Polly
* Make a universal translator
  * Transcribe -> Translate -> Polly
* Build a Jeff Bezos detector!
  * DeepLens -> Rekognition
* Are people on the phone happy?
  * Transcribe -> Comprehend                         

                                           

  

## 6. ML IMPLEMENTATIONS & OPERATIONS

**SAGEMAKER & DOCKER CONTAINERS**
* **SageMaker + Docker**
  * All models in SageMaker are hosted in Docker containers that are register with ECR
    * Pre-built Deep Learning
    * Pre-built scikit-learn and Spark ML
    * Pre-built Tensorflow, MXNet, Chainer, PyTorch
      * Distributed training via **Horovod** or **Parameter Servers**
    * Your own training and inference code! Or extend a pre-built image
  * This allows you to use any script or algorithm within SageMaker, regardless of runtime or language
    * Containers are isolated and contain all dependencies and resources needed to run
* **Using Docker**
  * Docker containers are created from *images*
  * Images are built from a *Dockerfile*
  * Images are saved in a *repository*
    * Amazon Elastic Container Registry (ECR)
* **Structure of a Training Container**
  * opt/ml
    * input
      * config
        * hyperparameters.json
        * resourceConfig.json
      * data
        * \<channel_name>
        * \<input_data>
    * model
    * code
      * \<script_files>
    * output
      * failure
* **Structure of your Docker Image**
  * WORKDIR
    * **nginx.conf**: configuration file for the Nginx front end, so basically we're gonna be running a web server and that's how we configure that web server
    * **predictor.py**: the program that implements a Flask Web Server for making those predictions at runtime
    * **serve/**: the serve directory, that program in there will be started when the container is started from the hosting. That file just launch the Gunicorn server which runs multiple instances of a Flask application that is defined in your *predictor.py*
    * **/train**: the train folder contains the program that's invoked when you run the container for training. So to implement your own training algorithm you would modify the program that lives in there 
    * **wsgi.py**: it's just a small wrapper that's used to invoke your Flask application for serving results
* **Assembling it all in a Dockerfile**
  * *FROM tensorflow/tensorflow:2.0.0a0*
  * *RUN pip install sagemaker-containers*
  * *COPY train.py opt/ml/code/train.py*
    * \# Copies the training code inside the container
  * *ENV SAGEMAKER_PROGRAM train.py*
    * \# Defines train.py as script entrypoint
* **Environment Variables**
  * SAGEMAKER_PROGRAM
    * Run a script inside opt/ml/code
  * SAGEMAKER_TRAINING_MODULE
  * SAGEMAKER_SERVICE_MODULE
  * SM_MODEL_DIR
  * SM_CHANNELS / SM_CHANNEL_*
  * SM_HPS / SM_HP_*
  * SM_USER_ARGS
  * ...and many more
* **Using your own Image**
  * *cd dockerfile*
  * *!docker build -t foo*
  * *from sagemaker.estimator import Estimator*
    * *estimator = Estimator(image_name='foo', role='SageMakerRole', train_instance_count=1, train_instance_type='local')*
    * *estimator.fit()*
* **Production Variants**
  * You can test out multiple models on live traffic using Production Variants
    * Variants Weights tell SageMaker how to distribute traffic among them
    * So, you could roll out a new iteration of your model at say 10% variant weight
    * Once you're confident in its performance, ramp it to 100%
  * This lets you do A/B tests and to validate performance in real-world settings
    * Offline validation isn't always useful (e.g. Recommender Systems, where accuracy on people's past behaviour isn't always a good indicator of their performance on future or unseen behaviour)

<br>

**SAGEMAKER ON THE EDGE**
* **SageMakerNeo**
  * Train once, run anywhere
    * Edge devices
      * ARM, Intel, Nvidia processors
      * Embedded in whatever - your car?
  * Optimizes code for specific devices
    * Tensorflow, MXNet, PyTorch, ONNX, XGBoost
  * Consists of a compiler and a runtime
    * **compiler**
      * re-compiles that code into the .py code expected by those edge processors
    * **runtime**
      * runs on those edge devices to consume that Neo generated code
* **Neo + AWS IoT Greengrass**
  * Neo-compiled models can be deployed to an HTTPS endpoint
    * Hosted on C5, M5, M4, P3 or P2 instances
    * Must be same instance type used for compilation
  * OR! you can deploy to IoT Greengrass
    * This is how you get the model to an actual edge device
    * Inference at the edge with local data, using model trained in the cloud
    * Uses Lambda inference applications
* **Recap**
  * **Neo**: compiles your trained model into specific architectures that might be deployed to
    * **Edge Devices**
    * **IoT Greengrass**

<br>

**SAGEMAKER SECURITY**
* **General AWS Security**
  * Use Identity and Access Management (IAM)
    * Set up user accounts with only the permissions they nedd
  * Use MFA
  * Use SSL/TLS when connecting to anything
  * Use CloudTrail to log API and user activity
  * Use encryption
  * Be careful with PII (Personal Identifying Information)
* **Protecting Your Data at Rest in SageMaker**
  * AWS Key Management Service (KMS)
    * Accepted by notebooks and all SageMaker jobs
      * Training, tuning, batch transform, endpoints
      * Notebooks and everything under *opt/ml* and */tmp* can be encrypted with a KMS key
  * S3
    * Can use encrypted S3 buckets for training data and hosting models
    * S3 can also use KMS
* **Protecting Data in Transit in SageMaker**
  * All traffic supports TLS/SSL
  * IAM roles are assigned to SageMaker to give it permissions to access resources
  * Inter-node training communication may be optionally encrypted
    * Can increase training time and cost with deep learning
    * AKA inter-container traffic encryption
    * Enabled via console or API when setting up a training or tuning job
* **SageMaker + VPC**
  * Training jobs run in a Virtual Private Cloud (VPC)
  * You can use a private VPC for even more security
    * You'll need to set up S3 VPC endpoints
    * Custom endpoint policies and S3 bucket policies can keep this secure
  * Notebooks are Internet-enabled by default
    * This can be a security hole
    * If disabled, your VPC needs an interface endpoint (PrivateLink) or NAT Gateway and allow outbound connections for training and hosting to work
  * Training and Inference Containers are also Internet-enabled by default
    * Network isolation is an option but this also prevents S3 access
* **SageMaker + IAM**
  * User permissions for:
    * CreateTrainingJob
    * CreateModel
    * CreateEndpointConfig
    * CreateTransformJob
    * CreateHyperParameterTuningJob
    * CreateNotebookInstance
    * UpdateNotebookInstance
  * Predefined policies:
    * AmazonSageMakerReadOnly
    * AmazonSageMakerFullAccess
    * AdministratorAccess
    * DataScientist
* **SageMaker Logging and Monitoring**
  * CloudWatch can log, monitor and alarm on:
    * Invocations and latency of endpoints
    * Health of instance nodes (CPU, memory, etc)
    * Ground Truth (active workers, how much they are doing)
  * CloudTrail records actions from users, roles and services within SageMaker
    * Log files delivered to S3 for auditing

<br>

**SAGEMAKER RESOURCES MANAGEMENT**
* **Choosing Your Instance Types**
  * We covered this under "modeling" even though it's an operations concern
  * In general, algorithms that rely on deep learning will benefit from GPU instances (P2 or P3) for training
  * Inference is usually less demanding and you can often get away with compute instances there (C4, C5)
  * GPU instances can be really pricey
* **Managed Spot Training**
  * Can use EC2 Spot Instances for training
    * Save up to 90% over on-demand instances
  * Spot instances can be interrupted!
    * Use checkpoints to S3 so training can resume
  * Can increase training time as you need to wait for spot instances to become available 
* **Elastic Inference**
  * Accelerates deep learning inference
    * At fraction of cost of using a GPU instance for inference
  * EI accelerators may be added alongside a CPU instance
    * ml.eia1.medium/large/xlarge
  * EI accelerators may also be applied to notebooks
  * Works with Tensorflow, PyTorch and MXNet pre-built containers
    * ONNX may be used to export models to MXNet
  * Works with custom containers built with EI-enabled Tensorflow, PyTorch or MXNet
  * Works with Image Classification and Object Detection built-in algorithms
* **Automatic Scaling**
  * You set up a scaling policy to define target metrics, min/max capacity and cooldown periods
  * Works with CloudWatch
  * Dynamically adjusts number of instances for a production variant
  * Load test your configuration before using it!
* **SageMaker and Availability Zones**
  * SageMaker automatically attempts to distribute instances across availability zones
  * But you need more than one instance for this to work!
  * Deploy multiple instances for each production endpoint
  * Configure VPCs with at least two subnets, each in a different AZ

<br>

**SAGEMAKER SERVERLESS INFERENCE**
* **Serverless Inference**
  * Introduced in 2022
  * Specify your container, memory requirement, concurrency requirements
  * Underlying capacity is automatically provisioned and scaled
  * Good for infrequent or unpredictable traffic; will scale down to zero when there are no requests
  * Charged based on usage
  * Monitor via CloudWatch
    * ModelSetupTime, Invocations, MemoryUtilization
* **AWS SageMaker Inference Recommender**
  * Recommends best instance type & configuration for your models
  * Automates load testing model tuning
  * Deploys to optimal inference endpoint
  * How it works:
    * Register your model to the model registry
    * Benchmark different endpoint configurations
    * Collect & visualize metrics to decide on instance types
    * Existing models from zoos may have benchmark already
  * Instance Recommendations
    * Run load tests on recommended instance types
    * Takes about 45 minutes
  * Endpoint Recommendations
    * Custom load test
    * You specify instances, traffic patterns, latency requirements, throughput requirements
    * Takes about 2 hours

<br>

**SAGEMAKER INFERENCE PIPELINES**
* **Inference Pipelines**
  * Linear sequence of 2-15 containers
  * Any combinantion of pre-trained built-in algorithms or your own algorithms in Docker containers
  * Combine pre-processing, predictions, post-processing
  * Spark ML and scikit-learn containers OK
    * Spark ML can be run with Glue or EMR
    * Serialized into MLeap format
  * Can handle both real-time inference and batch transforms
