# CPUs on demand (Compute Engine)

### Words : 
- No-ops = No operations
- fault-tolerant : ระบบที่คงทนต่อความเสียหาย มีอะไรเจ๊งสักตัว ระบบก็ยังทำงานต่อได้ (เช่น ระบบที่ทำ load balancing)
- Preemtible VM or Preemptible Instances : เครื่องที่เหลือในคลาวด์มาขายในราคาถูก สามารถถูกชัตดาวน์ได้ทุกเมื่อ แต่หากถูกชัตดาวน์ก่อนครบสิบนาที ทางกูเกิลจะให้ใช้งานฟรี และก่อนชัตดาวน์จะมีสัญญาณให้เวลาเตรียมตัว 30 วินาที ทุกเครื่องไม่สามารถรันได้เกิน 24 ชั่วโมง จะถูกชัตดาวน์ทิ้งเสมอ

CPUs (compute processing units) on the cloud are provided by a compute engine of vitual machines.

Like any computer, and the cloud computer is a computer, you need **compute processing units**, CPUs. And the CPUs on the cloud are provided by a **compute engine** of virtual machines. 

And you need a place to **store your input data**, to **store your output data**, to **store your intermediate data**, things that are persistent, things that are temporary. And that *storage on GCP* is provided by **cloud storage**. 

And connecting the two, *connecting the virtual machines or compute engine instances with these storage units or cloud storage* is a **private network**. You think not to directly interact with this network but it's there and it is what allows you to have a global scale data and compute infrastructure. 

![](img/14.png)

For the most part, when you work with GCP, you will not be working at the level at which we're going to be talking about in this chapter. *So we're not going to be working at the level of individual virtual machines. No, you're not going to be spinning up VMs in order to do a job, we'll be working with things that are much higher level than that.* 

But even if you need to work at this low level, in terms of infrastructure, the design goals of GCP remain the same. And the design goal is for working with cloud infrastructure to be as no-ops as possible. And no-ops here essentially means that we want to minimize a system administration overhead. And because we're talking about computing and storage, we want to basically also mention that we want this to be as flexible as possible. In such a way that you can change the type of virtual machine that you're running without paying any penalties, for example. So you're not reserving instances for long periods of time. In fact, we want to make it as flexible and easy for you to get your compute jobs done as possible. 

So when we talk about compute engine, the idea is in terms of flexibility. 

You can go ahead and get a **compute engine** that is say, N1 standard four and that's a very specific configuration of machines that you can have. *But we would like you to stop thinking about it in terms of this very specific physical infrastructure*, and instead **start thinking in terms of more abstract concepts. So for example you might say, I want a virtual machine that has 8 CPUs and 30 gig of RAM.** 

![](img/15.png)

And it's the job of the Google cloud infrastructure to go ahead and fetch you a virtual machine that has 8 CPUs and 30 gigs of RAM. 

Regardless of the type of machine that you get, you will always get load balancing, advanced networking, monitoring, clustering, container support, etc. So there is no second class machine here. Every machine that you have, has all of these capabilities built in. At the same time we want to give you flexible compute. And you lose flexibility whenever you say that I have to go ahead and get a machine, and have to keep it running for months on end. 

Because face it, *if you're running a machine for months on end, you have essentially bought the machine.* And what we want is for you to **work with machines on the order of minutes**. However, there are always going to be workloads where you might find yourself having a machine and using it fully tilt for long periods of time. Rather than ask you to try to determine which of your workloads you are going to be running for long periods of time, **GCP gives you a discount after the fact**. 

So at the end of the month, if it turns out that you've used a machine for 60% of the month, you will automatically get a 15% discount. And this is something that happens on your bill after we've found that you've used it. So what this means is that you always get to retain your agility. 

**So, for example, if you have a workload that's currently running on 8 CPUs and you decide that you need to increase it to 12 CPUs, for a few hours, well, go ahead and do that, right? You can move your workload to a different machine when you need to and move it back to a smaller machine when the peak loads go away.**

In addition to this whole idea of being able to change the machine type of stopped instances, you have another concept that's very, very, very useful, especially when it comes to jobs like Hadoop jobs. And this is the idea for **preemptible virtual machine**. 

The reason that GCP, one of the reasons that GCP can say, well, if you want an 8 CPU machine, 30 gig of RAM, we'll find it for you and we'll give to you, is because **some of those machines that are currently being used are what are called preemptible**. Whoever is using those machines has agreed that in return for a hefty 80% discount on the machine charge, they agree to give it up if someone comes along and is willing to pay full price for those machines. **So that's what a preemptible machine is**. 

> So a preemptible machine is a machine that you get a great discount on in return for your flexibility, in letting go of of it when you don't need it. 

![](img/16.png)

**But why would you do that?**

Why would have a machine that you're willing to give up? 

Well, if you're running a workload like Hadoop, which is *fault-tolerant*, if a machine goes away, well, whatever that machine was doing, those jobs get basically distributed among the other workers. *Then preemptible machines are a great strategy to reduce your overall cost.* 

For example, that you're creating a **data proc cluster**, a **data proc** is a **Hadoop cluster** on GCP, but we look at it in the next chapter. So you may say I'm going to create a data proc cluster, and in my data proc cluster, I'm going to have **10 standard VMs and 30 preemptible VMs**. So now your job is going to get done four times faster. And at the same time, **those extra 30 machines that you're using, are actually at 80% of the normal cost**. So not only are you getting it done faster, **you're also getting it done cheaper**. So preemptible machines are good thing to incorporate into your strategy. With the idea that even if you don't get a preemptible machines, those standard machines are enough for you to get the job done in a timely manner. So you don't want to bank on a preemptible machine, being available when you need it, but if it is available and you happen to get it, you automatically gotten a huge discount on the total cost of your job. 

# Lab : Start a Compute Engine instance

![](img/17.png)

### Start Compute Engine instance

#### Overview
In this lab you spin up a virtual machine, configure its security, and access it remotely.

#### What you need
To complete this lab, you need:
- A project created on Google Cloud Platform

#### What you learn
In this lab, you:
- Create a Compute Engine instance with the necessary Access and Security
- SSH into the instance
- Install the software package Git (for source code version control)

Start the Codelab
https://codelabs.developers.google.com/codelabs/cpb100-compute-engine/

# A global filesystem (Cloud Storage)

Let's continue our journey into the low level infrastructure of GCP 
 
**Why are we talking about cloud storage?**

Well, one of the reasons that we are talking about cloud storage is that this is the way that you stage any input into a 
- **relational database which is Cloud SQL**
- into **BigQuery which is a data warehouse**.
- or into **Dataproc which is a Hadoop cluster**. 

So **if you want to stage the data into GCP, you first have to get the data into cloud storage**. So cloud storage is blob storage. So you are basically storing raw data in any format, directly onto cloud storage. 

In order to get there though, normally what you might want to do is to get this data from somewhere else. It may be in your data center, it could be on instruments out in the field, it could be logs that are being created. Tends to be that you're basically ingesting extracting this data. You're doing some processing, and that processing could happen on a compute engine. So you could have a compute engine VM and you're basically doing a whole bunch of processing. And **if you need to do processing, you need very fast seeks, reads and writes of the data. And a good way to do that is to basically store that data on disk**. The problem is that any disk, although you have persistent disks, let's ignore that for now. **Typical disks are associated with the compute engine** that they're attached to. When the compute engine goes away, the disk also goes away. 

![](img/18.png)

So in **order to keep data persistent**, the standard practice is to take this data and not store it on a persistent disk, not store it on a disk, because those tend to be expensive, but **instead to store it on cloud storage**. Cloud storage is persistent storage, it's durable, it's replicated. It can be made globally available if you need to. And you can use it to stage the data onto other GCP products. So often the very first step of the data life cycle is to get your data onto cloud storage. 


### So how do you get your data onto cloud storage? 

![](img/19.png)

The simplest way is to **use a command line tool called gsutil**. It comes with the G cloud SDK, so you can install the G cloud SDK and once you have G Cloud SDK installed, you will have the gsutil command line. So whichever machine you're going to be uploading the data from, install G Cloud, get gsutil, and then say gsutil copy, that's a cp, gsutil cp sales*.csv. That's a file that I'm copying. Where am I copying to actually files that I'm copying, where am I copying these files to? 

Well, in this case I'm copying them to Google Storage, that's GS, and I'm giving them a full destructure/data, and I'm putting all of the sales files in /data. But /dataware, and that's where the concept of a bucket comes in. **Loosely, you can think of a bucket as like a domain name**, right. Or if you're on the Internet you have different machines, have different machine addresses, and then each of them has their own file system which is your web pages. Well, buckets are very similar to that. Acme sales in this case, plays the part of a domain name. It's this thing that makes it unique, that's your bucket. And you can create any number of buckets that you want and the bucket name that you provide has to be unique. 

> Most commonly, what people do is to make their bucket name have some relationship with their corporate domain name. 

Similar to the way that you prove that you own a domain. So prove that you own the domain, and you get a bucket name that matches the name of the domain. In the case of a classroom like this, we want a unique bucket name. But we're not going to go through the bother of proving our identity, etc. So we'll just try to come up with a bucket name that happens to be unique. 
 
So you can basically copy files, gsutil-cp. Besides that, you can also do remove, rm. mv is move, so that is essentially copy it and then delete it locally. You can do ls to list, so you can do listing of things on the cloud. The thing to realize is that this URL here, gs://acme-sales/data/ even though I explained it as like a folder structure, that's purely convenience, right. 

All that it really is, here's a name to blob. And this name is just a string. However, people tend to think of file systems as being hierarchical, and so you might tend to think of GCS as also being hierarchical. And that's kind of where ls comes in. You can basically go ahead and look at the hierarchy. But remember that this hierarchy is additional semantics that your placing on top of a pure key value store. You can also do 
- mb to make a bucket, 
- rb to remove a bucket, 
- rsync which is a Unix utility, 

so this is an emulation of that Unix utility. Where you can basically have a mirror of something that's local up on the cloud, and then whenever you run rsync again, it's just going to look at the files that have changed and upload only those files. ACL is an access control list, that's the way you change permissions. 

Even though I'm retalking about using gsutil as a command line tool, you don't have to use it as a command line. Because all that that command line tool actually does is that it invokes a web service, a **REST API**. So you can make that exact same REST API call yourself, so that's one way to do it. 

The other way is that you could go to web console, the same way that we've created the compute engine. We could basically go to the, instead of going to the compute engine part of the user interface, **we can go to the storage part of the user interface and you could use that**. But because it's a REST API **you can also use Python or you can use Java or you can use your favorite language, any language that can talk HTTP, which is pretty much any language**. You can basically interact with cloud storage using the REST API. 

Things that you can do in the command line you can do with the REST API, and because you can do it with the REST API, **you can also do them from any language** that you want. And anything that you do from a graphical user interface, also uses the same REST API. So in the previous lab, we created a compute engine and we created it by using the web user interface. But we could have also done it using a command line, and the way you do it with a command line is that you would have said G Cloud instants create, right, and that would have allowed you to create a computer Ged instance. 

![](img/20.png)

So talking more about cloud storage, in addition to using GS utility is a one shot, take this data, copy it to the cloud there. **You can also set up at transfer service. The transfer service could be one time or could be recurring.** So you could say I'm going to basically take all the data from here and transfer it over, and as new data shows up I'm going to keep transferring it. And **the source of your transfer could be your local machine, it could be a local data center, something that's on premise, it could also be AWS with S3 brackets**, you can transfer them and you could keep this transfer service going. 

Now as we mentioned, **the whole idea of cloud storage is that you use it as a staging area**. So **you can input it into Cloud SQL, into BigQuery, into Dataproc**, into variety of the different analysis tools and databases. 

You could also use that to take the data from cloud storage and move it to a local SSD disk on a compute engine VM, so that if you're going to be reading some data routinely all the time, you might want to move it from cloud storage into local. 

But lots of times you will have objects that are related and they will be in the same quote unquote, folder structure or in a bucket level. So you can control access at that bucket level but every bucket belongs to a project, and a project is essentially the way you do billing, etc, in GCP. So essentially, when you create a bucket in a project, you are basically saying, which billing account is going to be responsible for paying for the storage, right. 

So you can control access at that project level. You can say people who are editors on this project can also add and remove files from this particular bucket. But crucially one of the access control things that you can do is that you can actually make access control to all authenticated users. In other words, anybody who's logged in with a Google account, you can provide access to them. 

The other way that you could do it is you could provide access to all users. And by providing access to all users, what you're essentially saying is that they don't even need to be logged in, they can just come and get the data that you have. And they could get that data using just an HTTP URL. Why would you want to do that? Well, remember what I talked to you about GCP. I said that once you put something on cloud storage, that thing is durable, it's persistent, it's also edge cached, it has multiple copies made of it. So in other words, this is an easy way to get a content delivery network going for your data. Just take your data, your static data, put it on GCS, and give people the URL to that GCS location. And Google Cloud takes care of doing the edge-caching, and replication, and reliability, and durability, and all of those kinds of things. 

![](img/21.png)

So, if you're going to be putting a data on the cloud, right. And actually, even if you are creating instances, computing the instances to process data on the cloud. You should have a good idea about what zone, what region, you want to be doing the event. So what is a zone? What is a region? Well, it's a geographic construct. You can think of a zone as like a data center. So you want to choose the closest zone region to where all of your users are. And the reason you do that is that you want to reduce latency, right. So if almost all of your users are in the central US, you want to use it, you want to choose the data center in Iowa, and you would use us-central to close the zone, to close the region just to reduce our latency. 

But the problem with having everything in one zone is that what happens if that zone goes down, okay? If a tornado hits Iowa and the data center in Iowa is not able to be accessed, your application cannot be accessed either. So in order to limit service disruptions, you might want to have multiple zones within the same region. So for example, you might run your application in both zone b and zone f with zone f acting as a back up to zone b, for example. So distributing your absent data across zones is a way to reduce service disruptions. 

This is good if all of your users are in Europe, or if all of your users are in the US. So you would use us-central if they're all in the US, you will use europe-west if they're all in the EU. 

#### But what if you have a global application? 

You have an application, we have some users in Japan, and some users in Europe, and some users in the US. If that's the case, then you need to distribute your apps and data not just within a region but across regions. And the reason you do that is to basically make your applications globally available. So bottom line, control your latency by choosing the closest zone or region. 

Use multiple zones in a region to minimize the impact of service disruptions. 

And use multiple regions to provide global access to your application. 

# Lab : Interact with Cloud Storage

![](img/22.png)

### Overview
In this lab you carry out the steps of an ingest-transform-and-publish data pipeline manually.

### What you need
To complete this lab, you need:
- A project created on Google Cloud Platform [Lab 1]
- A Compute Engine instance created with access to Cloud APIs and with git installed [Lab 2a]

### What you learn
In this lab, you:
- Ingest data into a Compute Engine instance
- Transform data on the Compute Engine instance
- Store the transformed data on Cloud Storage
- Publish Cloud Storage data to the web

Start the Codelab
https://codelabs.developers.google.com/codelabs/cpb100-cloud-storage/

You will use real-time earthquake data published by the United States Geological Survey (USGS).

# Resources

![](img/23.png)

#### Resources
- Compute Engine: https://cloud.google.com/compute/
- Storage: https://cloud.google.com/storage/
- Pricing: https://cloud.google.com/pricing/
- Cloud Launcher: https://cloud.google.com/launcher/
- Pricing Philosophy: https://cloud.google.com/pricing/philosophy/
