Skip to content

Commit

Permalink
Vertex AI dataset, training works.
Browse files Browse the repository at this point in the history
  • Loading branch information
Lakshmanan, V committed Nov 4, 2021
1 parent 781d3ae commit 767a223
Show file tree
Hide file tree
Showing 19 changed files with 455 additions and 512 deletions.
4 changes: 4 additions & 0 deletions 02_ingest/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,10 @@

### Populate your bucket with the data you will need for the book
The simplest way to get the files you need is to copy it from my bucket:
* Open CloudShell and git clone this repo:
```
git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
```
* Go to the 02_ingest folder of the repo
* Run the program ./ingest_from_crsbucket.sh and specify your bucket name.

Expand Down
14 changes: 9 additions & 5 deletions 03_sqlstudio/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,15 @@
### Catch up to Chapter 2
If you have not already done so, load the raw data into a BigQuery dataset:
* Go to the Storage section of the GCP web console and create a new bucket
* In CloudShell, git clone this repository. Then, run:
```
cd data-science-on-gcp/02_ingest
./ingest_from_crsbucket bucketname
```
* Open CloudShell and git clone this repo:
```
git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
```
* Then, run:
```
cd data-science-on-gcp/02_ingest
./ingest_from_crsbucket bucketname
```


### Optional: Load the data into PostgreSQL
Expand Down
6 changes: 5 additions & 1 deletion 04_streaming/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@

### Catch up until Chapter 3 if necessary
* Go to the Storage section of the GCP web console and create a new bucket
* In CloudShell, git clone this repository. Then, run:
* Open CloudShell and git clone this repo:
```
git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
```
* Then, run:
```
cd data-science-on-gcp/02_ingest
./ingest_from_crsbucket bucketname
Expand Down
59 changes: 31 additions & 28 deletions 05_bqnotebook/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,43 +3,46 @@
### Catch up from previous chapters if necessary
If you didn't go through Chapters 2-4, the simplest way to catch up is to copy data from my bucket:
* Go to the Storage section of the GCP web console and create a new bucket
* In CloudShell, git clone this repository. Then, run:
```
cd data-science-on-gcp/02_ingest
./ingest_from_crsbucket bucketname
```
* Open CloudShell and git clone this repo:
```
git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
```
* Then, run:
```
cd data-science-on-gcp/02_ingest
./ingest_from_crsbucket bucketname
```
* Run:
```
cd ../03_sqlstudio
./create_views.sh
```
```
cd ../03_sqlstudio
./create_views.sh
```
* Run:
```
cd ../04_streaming
./ingest_from_crsbucket.sh
```
```
cd ../04_streaming
./ingest_from_crsbucket.sh
```

## Try out queries
* In BigQuery, query the time corrected files created in Chapter 4:
```
SELECT
ORIGIN,
AVG(DEP_DELAY) AS dep_delay,
AVG(ARR_DELAY) AS arr_delay,
COUNT(ARR_DELAY) AS num_flights
FROM
dsongcp.flights_tzcorr
GROUP BY
ORIGIN
```
```
SELECT
ORIGIN,
AVG(DEP_DELAY) AS dep_delay,
AVG(ARR_DELAY) AS arr_delay,
COUNT(ARR_DELAY) AS num_flights
FROM
dsongcp.flights_tzcorr
GROUP BY
ORIGIN
```
* Try out the other queries in queries.txt in this directory.

* Navigate to the Vertex AI Workbench part of the GCP console.

* Start a new managed notebook. Then, copy and paste cells from <a href="exploration.ipynb">exploration.ipynb</a> and click Run to execute the code.

* Create the trainday table BigQuery table and CSV file as you will need it later

```
./create_trainday.sh
```
```
./create_trainday.sh
```
4 changes: 4 additions & 0 deletions 06_dataproc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@ To repeat the steps in this chapter, follow these steps.

### Catch up from Chapters 2-5
If you didn't go through Chapters 2-5, the simplest way to catch up is to copy data from my bucket:
* Open CloudShell and git clone this repo:
```
git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
```
* Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
* Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
* Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery:
Expand Down
4 changes: 4 additions & 0 deletions 07_sparkml/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@
If you didn't go through Chapters 2-6, the simplest way to catch up is to copy data from my bucket:

#### Catch up from Chapters 2-5
* Open CloudShell and git clone this repo:
```
git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
```
* Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
* Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
* Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery:
Expand Down
4 changes: 4 additions & 0 deletions 08_bqml/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@
If you didn't go through Chapters 2-7, the simplest way to catch up is to copy data from my bucket:

#### Catch up from Chapters 2-7
* Open CloudShell and git clone this repo:
```
git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
```
* Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
* Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
* Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery:
Expand Down
65 changes: 33 additions & 32 deletions 09_vertexai/README.md
Original file line number Diff line number Diff line change
@@ -1,46 +1,47 @@
# Machine Learning Classifier using TensorFlow

### Catch up from previous chapters if necessary
If you didn't go through Chapters 2-8, the simplest way to catch up is to copy data from my bucket:
If you didn't go through Chapters 2-7, the simplest way to catch up is to copy data from my bucket:

#### Catch up from Chapters 2-7
* Open CloudShell and git clone this repo:
```
git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
```
* Go to the 02_ingest folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
* Go to the 04_streaming folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.
* Create a dataset named "flights" in BigQuery by typing:
```
bq mk flights
```
* Go to the 05_bqnotebook folder of the repo, run the script to load data into BigQuery:
```
bash load_into_bq.sh <BUCKET-NAME>
```
* In BigQuery, run this query and save the results as a table named trainday
```
#standardsql
SELECT
FL_DATE,
IF(MOD(ABS(FARM_FINGERPRINT(CAST(FL_DATE AS STRING))), 100) < 70, 'True', 'False') AS is_train_day
FROM (
SELECT
DISTINCT(FL_DATE) AS FL_DATE
FROM
`flights.tzcorr`)
ORDER BY
FL_DATE
bash create_trainday.sh <BUCKET-NAME>
```
* Go to the 08_dataflow folder of the repo, run the program ./ingest_from_crsbucket.sh and specify your bucket name.

## This Chapter

### Vertex AI Workbench
* Open a new notebook in Vertex AI Workbench from https://console.cloud.google.com/vertex-ai/workbench
* Launch a new terminal window and type in it:
```
git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp
```
* In the navigation pane on the left, navigate to data-science-on-gcp/09_vertexai
* Open the notebook flights_model_tf2.ipynb and run the cells. Note that the notebook has
DEVELOP_MODE=True and so it will take on a very, very small amount of data. This is just
to make sure the code works.

### This Chapter
You can do it two ways: from a notebook or from CloudShell.

#### 1. From AI Platform Notebooks
* Start Cloud AI Platform Notebook instance with TensorFlow 2.1 or greater
* Open a Terminal and type:
``` git clone https://github.com/GoogleCloudPlatform/data-science-on-gcp```
* Browse to, and open flights_model_tf2.ipynb
* Run the code to train and deploy the model
* The above code was on a small subset of the model. To run on the full dataset, run the cells in flights_caip.ipynb

#### (Optional) From CloudShell
#### From CloudShell
* Install the aiplatform library
```
pip3 install google-cloud-aiplatform
```
* Try running the standalone model file on a small sample:
```
python3 model.py --bucket <bucket-name> --develop
```
* Run a Vertex AI Pipeline on the full data set:
```
python3 train_on_vertexai.py --project <project> --bucket <bucket-name>
```
* Get the model to predict:
```
./call_predict.py --project=$(gcloud config get-value core/project)
Expand Down
8 changes: 0 additions & 8 deletions 09_vertexai/flights/Dockerfile

This file was deleted.

10 changes: 0 additions & 10 deletions 09_vertexai/flights/PKG-INFO

This file was deleted.

10 changes: 0 additions & 10 deletions 09_vertexai/flights/push_docker.sh

This file was deleted.

5 changes: 0 additions & 5 deletions 09_vertexai/flights/setup.cfg

This file was deleted.

31 changes: 0 additions & 31 deletions 09_vertexai/flights/setup.py

This file was deleted.

14 changes: 0 additions & 14 deletions 09_vertexai/flights/trainer/__init__.py

This file was deleted.

Loading

0 comments on commit 767a223

Please sign in to comment.