#### Copyright 2017 Google LLC.

In [0]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#Welcome: Labs for Introduction to Machine Learning#

## Creating Your Copy for Each Lab

* Start each lab by saving a copy, which you can do by going to the `File` menu and selecting "`Save a copy in Drive`"
* In Drive, modify "`Copy of ...`" to whatever name you'd like for your notebook
* Colab notebooks can be shared just as you would with Google Docs or Sheets. Simply click the `Share button` at the top right of your notebook, or follow these [Google Drive file sharing instructions](https://support.google.com/drive/answer/2494822?co=GENIE.Platform%3DDesktop&hl=en).

## Some Basics of Using Colab

* Since Colab runs in the Cloud you do not need to load any libraries on your computer.
* When doing these labs, you'll generally run one cell at a time looking at and thinking about the results before moving to the next cell.
* Remember whenever you modify a cell you will need to run that cell again for the change to take effect.
* It is common to have code blocks that do not generate any output (e.g. imports, definition of procedures,...). These generally run quickly, and you'll know code execution is complete when the "arrow" to the left of the cell is shown again. When you select the next cell you will see a number showing the order in which the code blocks were executed.  If you want an output when a code block is run you can always choose to place a print statement at the end of the block.
* The state for a notebook is global so even if you remove the definition of a variable once a cell has been run, it will still be defined.  Go to the `Runtime` menu and select "`Restart runtime...`" if you want to reset the runtime.  After doing this you will need to re-run the cells.  If you have made some changes to fix a bug, you might want to do this to be sure you are starting from a fresh state.
* Code is executed in a virtual machine dedicated to your account. Virtual machines are recycled when idle for a while, and have a maximum lifetime enforced by the system. You will need to rerun all cells when your virtual machine is recycled.
* Within the `Runtime` menu you can use "`Run all`", `"Run before`", or "`Run after`" to run multiple cells at once.
* Use the `+ CODE` button between cells (or at the top) to add a cell for code.
* Use the `+ TEXT` button between cells (or at the top) to add a cell for text.
* To edit a text cell double click on it and then edit using the [mark-up language](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/markdown_guide.ipynb).

## Labs that use the UCI Automobile Data Set for Predicting Real-Valued Labels (Regression)

We start with the  UCI Automobile data set since it is an easy to understand data set that has missing data and both numerical and categorical data.

We start with using Pandas to load and explore the raw data.  Next we put together all the pieces needed to train a linear regression model in TensorFlow and to visualize the results and learning curve. You will begin to explore setting the learning rate and number of steps in this first lab.

Next, you will learn how to train a model with multiple features including some key feature enginering such as feature normalization, bucketizing real-valued features, and feature crosses .Finally,you will also incorporate categorical features with the goal of training the best model then can to predict the city mpg for a car based on the other available features.


###[Lab 1: Loading and Understanding Your Data](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Lab_1__Loading_and_Understanding_Your_Data.ipynb)##

**Learning Objectives:**

* Learn the basics of reading data with Pandas
* Learn the basics of data cleaning and handling missing data using Pandas
* Learn how to visualize data with a scatter plot
* Use Numpy to generate the line minimizing squared loss
* Explore visually the difference in the model when replacing missing items by 0s versus the mean value for that feature

###[Lab 2: Training Your First TF Linear Regression Model](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Lab_2__Training_Your_First_TF_Linear_Regression_Model.ipynb)

**Learning Objectives:**

* Use pyplot to help visualize the data, the learned model, and how the loss is evolving during training
* Learn how to set up the features in TensorFlow to train a model.
* Use the LinearRegressor class in TensorFlow to predict a real-valued featured based on one real-valued input feature
* Visualize the resulting model using pyplot
* Evaluate the accuracy of a model's predictions using Root Mean Squared Error (RMSE)
* Improve the accuracy of a model by tuning the learning rate and number of training steps

###[Lab 3: Using Multiple Numerical Features and Feature Scaling](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Lab_3__Using_Multiple_Numerical_Features_and_Feature_Scaling.ipynb)

**Learning Objectives:**
* Train a model using more than one feature
* Learn the importance of feature transformations
* Introduce linear and log transformations of features

###[Lab 4: Using Bucketized Numerical Features](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Lab_4__Using_a_Bucketized_Numerical_Feature.ipynb)

**Learning Objectives:**
* Create bucketized numerical features in TF and use them to train a model
* Use visualizations to understand the value of using bucketized features

###[Lab 5: Using Categorical Features](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Lab_5__Using_Categorical_Features.ipynb)

**Learning Objectives:**

* Use numerical and categorical features in TF to train a model


## Labs that use the California Housing Data for Predicting Real-Valued Labels

The next set of labs use the California Housing Data. We start by splitting the training data into a train and validation set, and visually demonstrating what can happen if you don’t randomize the data before creating the train/validation data split.  Next we introduce synthetic features since those are important for this data set and this is a nice illustration of the kind of feature engineering that can be done to train good linear models on real data sets.

###[Lab 6: Creating Validation Data](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Lab_6__Creating_Validation_Data.ipynb)

**Learning Objectives:**
  * Generate a train and validation data set for housing data that we will use to predict median housing price, at the granularity of city blocks.
  * Debug issues in the creation of the train and validation splits.
  * Select the best single feature to use to train a linear model to predict the median housing price.
  * Test that the prediction loss on the validation data accurately reflect the trained model's loss on unseen test data.

###[Lab 7: Feature Engineering - Creating Synthetic Features](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Lab_7__Feature_Engineering_-_Creating_Synthetic_Features.ipynb)

**Learning Objectives:**

* Gain more experience with the LinearRegressor class in TensorFlow by using it to predict median housing price, at the granularity of city blocks
* Use a validation data set and test set to make sure that our model will generalize and is not overfitting the training data.
* Use test data only after tuning hyperparameters as a measure of how the model will generalize to new data
* Create synthetic features from the existing features (e.g., taking a ratio of two other features)
* More practice with feature transformations including identifying and clipping (removing) outliers out of the input data to obtain the best model


## Labs that use the Census Data for a Classification Problem


We start with a linear classifier that just uses the raw numerical and categorical features.  We had introduced bucketized features earlier but in the first lab on this data sets introduces quantiles as a way to avoid hand picking the thresholds and also combines the bucketized features with the raw features. The next step to further improve a linear model is to introduce feature crosses.  After doing this, the students should begin to see some overfitting so it’s natural to introduce L2 regularization, and then L1 regularization to reduce the model size.  Finally, a DNN can be introduced and compared to a linear model.  Students should begin to think about the pros and cons in moving from a linear model with crosses to a DNN.

###[Lab 8: Train a Linear Classifier with Numerical and Categorical Features](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Lab_8__Training_a_Linear_Classifier_with_Numerical_and_Categorical_Features.ipynb)

**Learning Objectives:**
* Introduce logistic regression to train a binary classifier.
* Understand metrics such as ROC curves, AUC, log loss, classification errors.
* Train a linear classifier using the raw numerical and categorical features.

###[Lab 9: Bucketized Features using Quantiles and Feature Crosses](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Lab_9__Bucketized_Features_Using_Quantiles_and_Feature_Crosses.ipynb)

**Learning Objectives:**
  * Learn to use quantiles to create bucketized features.
  * Learn how to introduce feature crosses.
  * Starting from just having the data loaded, train a linear classifier to predict if an individual's income is at least 50k using numerical features, categorical features, bucketized features, and feature crosses.

###[Lab 10: Regularization to Reduce Overfitting and Model Size](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Lab_10__Regularization_to_Reduce_Overfitting_and_Model_Size.ipynb)

**Learning Objectives:**
* Replace the `StochasticGradientDescent` optimizer by the `FTRLOptimizer`
* Use L2 regularization to help reduce overfitting
* Use L1 regularization to create sparsity and reduce model size
* Look at ROC curves to understand trade-off between model size and accuracy

###[Lab 11: Train a DNN](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Lab_11__Train_A_DNN.ipynb)

**Learning Objectives:**
* Train a DNN and compare the performance to that when using a linear model with feature crosses and bucketized features.

## Lab to Apply Embeddings for Movie Review Sentiment Analysis

The final data set introduces the students to training embeddings in a DNN with the application are of sentiment analysis.

###[Lab 12: Learning Embeddings](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Lab_12__Learning_Embeddings.ipynb)

**Learning Objectives:**
* Represent movie-review as a bag of words
* Implement a sentiment-analysis linear model
* Implement a sentiment-analysis DNN model using an embedding that projects data into two dimensions
* Visualize the embedding to see what the model has learned about the relationships between words

##Extra Credit Lab: Training a DNN to Predict a Real-Valued Label (Regression)

###[Training a Deep Neural Network for a Regression Problem](https://colab.research.google.com/notebook#fileId=/v2/external/notebooks/intro_to_ml_semester_course/Extra_Credit_Lab__Training_a_DNN_for_a_Regression_Problem.ipynb)

**Learning Objectives:**
* Train a DNNRegressor on the housing data exploring different configurations of hidden units and choices of features to include.
* Another option is to introduce bucketized features and crosses and compare what you can do with a linear model using those as compared to a DNN.
* Another option is to introduce dropout as an alternate to L2 regulariz