Vertex AI dataset, training works.

GoogleCloudPlatform · Nov 4, 2021 · 767a223 · 767a223
1 parent 781d3ae
commit 767a223
Show file tree

Hide file tree

Showing 19 changed files with 455 additions and 512 deletions.
diff --git a/02_ingest/README.md b/02_ingest/README.md
@@ -5,6 +5,10 @@
 
 ### Populate your bucket with the data you will need for the book
 The simplest way to get the files you need is to copy it from my bucket:
+* Open CloudShell and git clone this repo:
+    ```
+    git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
+    ```
 * Go to the 02_ingest folder of the repo
 * Run the program ./ingest_from_crsbucket.sh and specify your bucket name.
 

diff --git a/03_sqlstudio/README.md b/03_sqlstudio/README.md
@@ -3,11 +3,15 @@
 ### Catch up to Chapter 2
 If you have not already done so, load the raw data into a BigQuery dataset:
 * Go to the Storage section of the GCP web console and create a new bucket
-* In CloudShell, git clone this repository. Then, run:
-```
-cd data-science-on-gcp/02_ingest
-./ingest_from_crsbucket bucketname
-```
+* Open CloudShell and git clone this repo:
+    ```
+    git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
+    ```
+* Then, run:
+    ```
+    cd data-science-on-gcp/02_ingest
+    ./ingest_from_crsbucket bucketname
+    ```
 
 
 ### Optional: Load the data into PostgreSQL

diff --git a/04_streaming/README.md b/04_streaming/README.md
@@ -2,7 +2,11 @@
 
 ### Catch up until Chapter 3 if necessary
 * Go to the Storage section of the GCP web console and create a new bucket
-* In CloudShell, git clone this repository. Then, run:
+* Open CloudShell and git clone this repo:
+    ```
+    git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
+    ```
+* Then, run:
 ```
 cd data-science-on-gcp/02_ingest
 ./ingest_from_crsbucket bucketname

diff --git a/05_bqnotebook/README.md b/05_bqnotebook/README.md
@@ -3,43 +3,46 @@
 ### Catch up from previous chapters if necessary
 If you didn't go through Chapters 2-4, the simplest way to catch up is to copy data from my bucket:
 * Go to the Storage section of the GCP web console and create a new bucket
-* In CloudShell, git clone this repository. Then, run:
-```
-cd data-science-on-gcp/02_ingest
-./ingest_from_crsbucket bucketname
-```
+* Open CloudShell and git clone this repo:
+    ```
+    git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
+    ```
+* Then, run:
+    ```
+    cd data-science-on-gcp/02_ingest
+    ./ingest_from_crsbucket bucketname
+    ```
 * Run:
-```
-cd ../03_sqlstudio
-./create_views.sh
-```
+    ```
+    cd ../03_sqlstudio
+    ./create_views.sh
+    ```
 * Run:
-```
-cd ../04_streaming
-./ingest_from_crsbucket.sh
-```
+    ```
+    cd ../04_streaming
+    ./ingest_from_crsbucket.sh
+    ```
 
 ## Try out queries
 * In BigQuery, query the time corrected files created in Chapter 4:
-```
-SELECT
-   ORIGIN,
-   AVG(DEP_DELAY) AS dep_delay,
-   AVG(ARR_DELAY) AS arr_delay,
-   COUNT(ARR_DELAY) AS num_flights
- FROM
-   dsongcp.flights_tzcorr
- GROUP BY
-   ORIGIN
-```
+    ```
+    SELECT
+       ORIGIN,
+       AVG(DEP_DELAY) AS dep_delay,
+       AVG(ARR_DELAY) AS arr_delay,
+       COUNT(ARR_DELAY) AS num_flights
+     FROM
+       dsongcp.flights_tzcorr
+     GROUP BY
+       ORIGIN
+    ```
 * Try out the other queries in queries.txt in this directory.
 
 * Navigate to the Vertex AI Workbench part of the GCP console.
 
 * Start a new managed notebook. Then, copy and paste cells from <a href="exploration.ipynb">exploration.ipynb</a> and click Run to execute the code.
 
 * Create the trainday table BigQuery table and CSV file as you will need it later
-
-```
-./create_trainday.sh
-```
+    ```
+    ./create_trainday.sh
+    ```
diff --git a/06_dataproc/README.md b/06_dataproc/README.md
@@ -4,6 +4,10 @@ To repeat the steps in this chapter, follow these steps.
 
 ### Catch up from Chapters 2-5
 If you didn't go through Chapters 2-5, the simplest way to catch up is to copy data from my bucket:
+* Open CloudShell and git clone this repo:
+    ```
+    git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
+    ```
 * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
 * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
 * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery:

diff --git a/07_sparkml/README.md b/07_sparkml/README.md
@@ -4,6 +4,10 @@
 If you didn't go through Chapters 2-6, the simplest way to catch up is to copy data from my bucket:
 
 #### Catch up from Chapters 2-5
+* Open CloudShell and git clone this repo:
+    ```
+    git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
+    ```
 * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
 * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
 * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery:

diff --git a/08_bqml/README.md b/08_bqml/README.md
@@ -4,6 +4,10 @@
 If you didn't go through Chapters 2-7, the simplest way to catch up is to copy data from my bucket:
 
 #### Catch up from Chapters 2-7
+* Open CloudShell and git clone this repo:
+    ```
+    git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
+    ```
 * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
 * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
 * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery:

diff --git a/09_vertexai/README.md b/09_vertexai/README.md
@@ -1,46 +1,47 @@
 # Machine Learning Classifier using TensorFlow
 
 ### Catch up from previous chapters if necessary
-If you didn't go through Chapters 2-8, the simplest way to catch up is to copy data from my bucket:
+If you didn't go through Chapters 2-7, the simplest way to catch up is to copy data from my bucket:
+
+#### Catch up from Chapters 2-7
+* Open CloudShell and git clone this repo:
+    ```
+    git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
+    ```
 * Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
 * Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
-* Create a dataset named "flights" in BigQuery by typing:
-	```
-	bq mk flights
-	```
 * Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery:
 	```
-	bash load_into_bq.sh <BUCKET-NAME>
-	```
-* In BigQuery, run this query and save the results as a table named trainday
-	```
-	  #standardsql
-	SELECT
-	  FL_DATE,
-	  IF(MOD(ABS(FARM_FINGERPRINT(CAST(FL_DATE AS STRING))), 100) < 70, 'True', 'False') AS is_train_day
-	FROM (
-	  SELECT
-	    DISTINCT(FL_DATE) AS FL_DATE
-	  FROM
-	    `flights.tzcorr`)
-	ORDER BY
-	  FL_DATE
+	bash create_trainday.sh <BUCKET-NAME>
 	```
-* Go to the 08_dataflow folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
+
+## This Chapter
 
+### Vertex AI Workbench
+* Open a new notebook in Vertex AI Workbench from https://console.cloud.google.com/vertex-ai/workbench
+* Launch a new terminal window and type in it:
+```
+git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
+```
+* In the navigation pane on the left, navigate to data-science-on-gcp/09_vertexai
+* Open the notebook flights_model_tf2.ipynb and run the cells.  Note that the notebook has
+DEVELOP_MODE=True and so it will take on a very, very small amount of data. This is just
+to make sure the code works.
 
-### This Chapter
-You can do it two ways: from a notebook or from CloudShell.
 
-#### 1. From AI Platform Notebooks
-* Start Cloud AI Platform Notebook instance with TensorFlow 2.1 or greater
-* Open a Terminal and type:
-  ``` git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp```
-* Browse to, and open flights_model_tf2.ipynb
-* Run the code to train and deploy the model
-* The above code was on a small subset of the model. To run on the full dataset, run the cells in flights_caip.ipynb
-
-#### (Optional) From CloudShell
+#### From CloudShell
+* Install the aiplatform library
+    ```
+    pip3 install google-cloud-aiplatform
+    ```
+* Try running the standalone model file on a small sample:
+    ```
+    python3 model.py  --bucket <bucket-name> --develop
+    ```
+* Run a Vertex AI Pipeline on the full data set:
+    ```
+    python3 train_on_vertexai.py --project <project> --bucket <bucket-name>
+    ```
 * Get the model to predict:
     ```
     ./call_predict.py --project=$(gcloud config get-value core/project)

diff --git a/09_vertexai/flights/Dockerfile b/09_vertexai/flights/Dockerfile
diff --git a/09_vertexai/flights/PKG-INFO b/09_vertexai/flights/PKG-INFO
diff --git a/09_vertexai/flights/push_docker.sh b/09_vertexai/flights/push_docker.sh
diff --git a/09_vertexai/flights/setup.cfg b/09_vertexai/flights/setup.cfg
diff --git a/09_vertexai/flights/setup.py b/09_vertexai/flights/setup.py
diff --git a/09_vertexai/flights/trainer/__init__.py b/09_vertexai/flights/trainer/__init__.py