### Learning Objectives
- Employ BigQuery and Cloud Datalab to carry out interactive data analysis
- Train and use a neural network using TensorFlow

# Fast Random Access

Let's look at some transformative changes, first Fast random access. 


Many programming languages are object oriented, which means much of the software that you write deals with objects and the good thing about objects is that they're hierarchical. Let's say for example that in your program you have a hierarchical 

![](img/42.png)

data structure like a Player. So you have players who are either footballers or cricketers, and cricketers not all of whom bowl. So you basically have some cricketers have bowling averages, but everyone bats, so all cricketers have batting averages. So you have this hierarchy, and you want to represent this in a relational database, because you want to persist the data. If you use a relational database to persist an object hierarchy, you get into an object relational impedance, there's a mismatch here. For example, we said that all cricketers have batting averages, but not all of them have bowling averages. However the table itself, if you go look at it there'll be a Name, there'll be a Club, there'll be a BattingAverage, there'll be a BowlingAverage. What is a BattingAverage of a football player? Makes absolutely no sense, but we have to have it. So we basically go ahead and put in a null in there, and once you do that you basically have data integrity problems from this point onwards. 

![](img/43.png)

So how do you prevent this kind of an issue with an object to relational mapping? Well one way is if you can store objects directly, and that's what Cloud Datastore on GCP lets you do. So Datastore scales up to terabytes of data, where in a relational database typically goes into a few gigabytes, Datastore can go up to terabytes and what you're storing in a Datastore conceptually is like a Hashmap. So it's a key or an id to an object, so you store the entire object in. So when you are writing to Datastore you're writing an entire object, but when you are reading from it, that's searching, you can search with the key but you can also search by a property. So you can look for all cricket players whose batting average is greater than 30 runs a game, right? 

2:26
And you can basically do this by taking one of those indexed feeds, which we'll look at shortly. 

2:34
You want to update, again you can update just the batting average of a player, and you can update this in a transactional way. So Datastore supports transactions, it allows you to read and write structured data. It is a pretty good replacement for use cases in your application, where you may be using a relational database to persist your data. However, this replacement is something that you would have to do explicitly. Unlike the things that we've talked about in the previous chapter you can't just, for example in the previous chapter we said you have a spark program that you are running on a Hadoop cluster On-Premise. You want to run it on GCP just run it on Dataproc, pretty much all of your code just migrates unchanged. If you have a MySQL code base, well whatever you're doing to your MySQL On-Premise you can do MySQL on Google Cloud using Cloud SQL, those are easy migrations, right? Take what you have, take those use cases that you have, just move them to the cloud, but when we talk about something like Datastore, now it's not that easy a migration. 


![](img/44.png)

You have to change the code that you're doing, where the way you interact with Datastore is different from the way you'd interact with a relational database. So how do you interact with Datastore? Well, the way you work with Datastore is that it's like a persistent Hashmap. So for example let's say we want to persist objects that are author objects. You'd say I have an author class, it's an Entity, that's the identity. It's an annotation that you add and I'm showing you Java here, but it works with a variety of object oriented languages. And you say that the author is distinguished by their email address, the email address is an Id column, so you say add Id. We want to search for authors by name, so we'd like the name property to be indexed, and just to show you that you can have hazard relationships, an author has a bunch of different interests. Same thing about guestbook entries, you store guestbook entries, each entry has an id that makes it unique, it basically has a Parent Key and Author. 

![](img/45.png)

These are the people who wrote the GuestbookEntry and that's a relationship. You have messages, we're never going to search apparently because it's not indexed. We're not going to search for guestbook entries based on the text of the message and we have dates, right? And that's something that we might want the search based on. So once you have an Entity, you have an Author, there's an Entity you have, an Author has an email which is the id, and a name which is the index.

![](img/46.png)

So you want to create an Author, you basically call the constructor, just as you would do for any plain old job or object. A new Author xjin@bu.edu, name is xjin, you have your Author object, but at this point the Author object is only in memory. You want to save it, you basically call save passing in the entity.

![](img/47.png)

ofy here is the objective file library, it's one of several Java libraries that help you deal with Datastore. So in this case this code is showing you objective file, we save the entity and at this point the xjin object has been persisted. 

![](img/48.png)

If you want to read it, if you want to search for it, what you can do is say load all authors and filter them by name Ha Jin, and because name is an indexed field we can do this. We can filter by name Ha Jin, and we will get back an iterable of authors. 


Why iterable and not a list of authors? 


Well because Datastore scales up to terabytes, so one of these columns that your search is based on, what comes back could be gigabytes of data, might be much more than can fit into memory, so we give you back an iterable. Well if you know you're going to get back only one item, such that's the second one here, you're loading authors and you're finding id xjin@bu.edu, at that point you're going to get one author back, so you best get back the author object. 

![](img/49.png)

We call it JH within the code, and now we can update the name of jh, you can say jh.name = Jin Xuefei and then save that entity. 


At this point now we basically have a newly persisted object, the object that's persisted basically has a new name, and then if you want to delete the entity, we just say delete entity jh. 


![](img/50.png)


So create, read, update, delete, you can pretty much do everything that you do in a relational database from the transactional way using Datastore. 

![](img/45.png)


![](img/51.png)

Another access patron, so these are again options to using a relational database, you could use Datastore if you need transactional support for hierarchical data, something that relational databases don't handle very well. 


Another reason that relational databases may not work very well, and we've discussed this in the module review section of the previous chapter, is if you have high throughput needs. If you have sensors that are distributed all across the world, and you're basically getting back millions of messages a minute, that's not something the Cloud SQL can handle very well. That's not something a relational database can handle very well, that's essentially a pen-only operation. We're just getting you data and we're saving, and we don't need transactional support. And because we are willing to give up transactional support, the capacity of Bigtable is no longer like the Terabytes that the Datastore can support, but Petabytes. On the other hand what we've given up is the ability to update just a single field of the object, we have to write an entirely new row. The idea is that if we get a new object, we basically append it to the table and then we read from latest data and go backwards. So that the very first object that we find at the particular key is the latest version of that object. 



So Bigtable is really good for high throughput scenarios, where you want to be not in the business of managing infrastructure, you want something to be as NoOps as possible. With Bigtable you basically deal with flattened data, it's not for hierarchical data, it's flattened and you search only based on the key. And because you can search only based on the key, the key itself and the way you design it becomes extremely important. Number one, you want to think about the key as being the query that you're going to make, because again you can only search based on the key. 

![](img/52.png)

You cannot search based on any property, and because you can't search fast based on properties, you're going to be searching based on keys, you want your key itself to be designed such that you can find the stuff that you want quickly. And the key should be designed such that there are no hotspots, you don't want all of your objects, all of your rows, falling into the same bucket, you want things to be distributed there. Tables themselves should be tall and narrow, right? Tall because you keep appending to it. Narrow, why? The idea being that, if you have Boolean flags for example, rather than have a column for each flag, and have the value be 0 or 1. Maybe you just have a column that says, these are the only true flags on this object. This kind of thing becomes extremely useful if you're trying to store, for example, user's ratings. A user may rate only like five out of the thousands of items in your catalog, and browser has thousands of columns, one for every item. You simply store object comma rating for the things that they've actually rated, and that could be a new column. 


Even though we said that your columns have to be flattened, there is this concept of a column family. So you can basically say for example here MD is market data. So MD colon symbol, MD colon last sale, this is a way to group together related columns. 


The reason to use big tables and it's NoOps, it's automatically balanced, it's automatically replicated, its compacted, it's essential NoOps. You don't have to manage any of that infrastructure, you can deal with extremely high throughput data. 

![](img/53.png)

This is how you work with Bigtable, you work with it using the hbase API. So that's why what if you're importing is org.apache.hadoop.hbase. So you work with it the way you would normally work with hbase, you basically get a connection. You basic go to the connection and you get your table, you create a put operation, you add all of your columns, and then you put that into the table and you've basically added a new row to the table. 


If you're familiar with hbase it's exactly the same way. 

# Interactive, Iterative Development & Datalab Demo

So, in this section, again we're talking about transformational cases. The next transformational case we want to talk about is about notebook development. 

![](img/54.png)


As a data scientist, one of the major changes that happened in the way I work with. 

No, but data is with Datalab. I think we recorded this in person already, so we should use that here. 

![](img/55.png)

So, let's, the way you work with Datalab, is that you have a web page, that's your Datalab webpage. And in this webpage, you can basically write code in Python, and once you write the code in Python, you can run that code by either hitting Shift Enter, or by clicking the Run button at the top of the menu. You can look at the output, you can go back, change the code, run the cell, change the code, run the cell and at that point some time you're basically satisfied with the way the code works you can go in and you can write some commentary right? And the commentary can be mark down format, it can have titles, it can have bold emphasis and all of those kinds of things. It can have links, images, you write your commentary your documentation, and then because it's just a web page you share it and other people can come in. And they can execute your notebook and they can make changes to your code and rerun your notebook, it becomes an extremely good collaborative environment. 


But one issue still exists, and the issue is where does the web server run? Remember that this is the client client site, the web page is a client but you still need the web server And if you run the Datalab or iPython or Jupiter Web Server on your own laptop, then as soon as you close down your laptop, nobody can access your notebook. So, you really want to have this thing running on a cloud computer, and Datalab Is a way to run it on GCP. 

![](img/56.png)

To work with this then, you have to think in terms of two things, where does a server run and where does a client run. So, one option is to run both locally. And this is if you're doing local development. Your own CPU, you store your Notebooks on disk And you access your notebook using localhost:8081/ for example, that's your port. And data log comes in a docker container, so you're basically running a docker image locally. So, that's good as long as you are the only one who needs to access this notebook. The second option, this is good if you have multiple people going to access a notebook, is to run the Docker container on Compute Engine. 


So you basically get a Google compute engine instant up, and going and on that computer gen instance you run that Docker Container and this way then whenever you need to connect to the notebook Use an ssh tunnel CloudShell will let you do this. So, you'll use an ssh tunnel where CloudShell to connect to this GCE 


instance and then inside your browser you're working with it, but remember that everything that you're running the code itself is getting executed on the computer engine instance. 

3:18
A third way to do this is to basically have it again on a computer engine instance, but access it through a gateway, so you basically have a proxy set up and this involves a little bit more in terms of setting up your browser in such a way that you are not going through cloud shut. For the purposes of this class, the CloudShell approach is what we will use. But if you are going to be doing this a lot. If you are a data scientist, you do quite a bit of data science work. You might want to explore the third option, because remember that CloudShell is an ephemeral VM. And every 60 minutes or so it'll get recycled. And then you'd have to go create the SSH tunnel again. It's pretty easy to create an SSH tunnel, it's just a single script that you run, but it is still something that you might want to avoid doing, it can get pretty frustrating. So, if you're going to be doing this a lot, and if you're going to be doing this for more than 60 minutes, let's say if you're going to be doing this for more than 60 minutes, you might want to look at the third option. 

![](img/57.png)

But for the purposes of this class, we will use the second option because it's very simple, very easy to get going. We already have CloudShell. It's easy to create a Compute Engine VM. 


So, let's go ahead and do that. So, that code lab, that link, gives you directions for doing all three of these things. So, let's click on this link. And you will see that it takes you to option one. 

![](img/58.png)

One option, two or option three. Option one from G crowd SDK, option two from crowd shell and option three from local machine. These are actually reversed from the order in which we covered them but this is basically the order of recommendations if you will, okay. So, this is the one that we recommend, this is the one that's an okay thing, This is the one that you would have to be able to install software on your machine, etc. 


So, this cord lab is organized from the best to the not so best. So, let's go ahead and, in this class we'll basically use option number two. To run Datalab from Cloud Shell. If you want, you can follow along with me, pause the video if necessary so you can catch up, and we can go on. So, here what I'm going to be doing is, number one, I'm going to open up Cloud Shell. And, if necessary, I'm going to clone the following Git Repository. I already have the Git Repo, So that I can copy this. And so we need to open up cloud shelf. So, how do we open cloud shelf? The way you open cloud shelf okay is to base go to cancel this okay 

![](img/59.png)

The way you open up CloudShell is to go to console.cloud.google.com. And this takes you to the JSP web console. 

...

ทำตามใน Guideline เลย

# Warehouse and Interactively Query Petabytes

![](img/60.png)

This is Arthur C Clarke. And one of his famous quotes is that, any sufficiently advanced technology is indistinguishable from magic. And to me, that perfectly captures my feelings when I first encountered BigQuery. BigQuery is magical. But rather than me telling you what the query is, let's see it in action. So I'm going to go to bigquery.cloud.google.com. 

![](img/61.png)

And in the BigQuery console I'm going to compose a query and that's basically going to be my query.

![](img/62.png)

I'm going to be searching on 10 billion rows of the Wikipedia data. And for every row, I'm going to do a regular expression match on the title to see if it matches Google. And I'm going to see which language that mention of Google was in, and how many views there were of those mentions. So we're going to look for pages in different languages that mention Google in the title and count the number of views of those. And we'll go to the Show Options, and say don't cache the results. 

![](img/63.png)

BigQuery by default will cache the results for it for a few days, so that if you rerun the exact same query, you're not paying for it a second time. But we don't want to cheat, so we'll not cache any results. And let's go ahead and hide the options, run the query and 1, 2, 3, 4 seconds, 5 seconds, 6.3 seconds later, 416 GB of data have been processed. 

![](img/64.png)

And we find out that most mentions of Google that people viewed were that in English and secondly it was in Español.


I mean, commons is obviously not a language. And the third was in German and Italian, followed by French, Japanese at number 8 and Dutch at number 9, etc. But the cool thing, realize, is that in 6 seconds, we were able to process nearly half a terabyte of data. This is awesome. This is magical. 

![](img/65.png)

So, this is what we just did, we did a demo of BigQuery. This is a query. We looked at 10 billion rows of data. We did selection, then we grouped it, we ordered it and we got all that done in about 6 seconds.

![](img/66.png)

So what BigQuery gives you is an interactive way to analyze petabytes of data. The queries that you write can be standard SQL, SQL 2011, so it's a very familiar language to most people. You can also write User Defined Functions in JavaScript. And you can access data, CSV data, JSON data on Cloud Storage and run a query on them without actually ingesting them into BigQuery. 

![](img/67.png)

Ingesting data into BigQuery though is very straightforward. If you have your data on disk, if it's small enough data, or if you have your data uploaded to Google Cloud or in Cloud Datastore, it's as easy as going to a web console and basically saying, here's my file. This is the destination. Here is a schema. These are the columns. These are the types. Load them in. It's very, very, very easy. And what you can do from the web console, you can also do from the command line using the BQ command. You can stream data in with Cloud Dataflow. And this is very, very helpful if, for example, you're receiving sensor data in real time, blog data time in real time, you can process them with Cloud Dataflow and stream them into BigQuery. And even as the data are streaming in, you can run queries on that data. Besides batch data on cloud storage or streaming data via Cloud Dataflow, a third option is to not load the data into BigQuery at all. Is to leave your data in raw form as CSV files or JSON files or Avro files and set up what's called a federated data source. You're basically creating a link to it and saying, here is the name, here is the schema, there is my file, so don't make a copy of that file and directly query that file. That last option there, Google Sheets, is extremely interesting. The idea is that you can have some data in a Google Sheet and query it with BigQuery. And now you are saying, wait a second, a Google Sheet, what? I probably have like 30 rows in my Google Sheet, why would I use a SQL BigQuery query to query it? Well, it's not about the data in the Google Sheet. It's about what you can join it with. So you could, for example, have millions of rows, billions of rows of data in BigQuery and join it with the smaller data that you have in Google Sheets. So you may have your customer data in Google Sheets, your sales data in BigQuery, and you'll be able to basically correlate that customer. Join them with this much more massive data that you have in BigQuery. So being able to join a table in Sheets with a table in BigQuery is extremely powerful.

![](img/68.png)

You can also write user defined functions in BigQuery, so you're not limited to SQL 2011. Here for example, I'm calling a method called urlDecode. That method is defined as a user defined function and it's written, it's implemented in JavaScript. And you can have natural JavaScript, so people have done even more very complex things like natural language processing. There's a JavaScript library that does natural language processing, and so they're able to basically write a UDF to process, for example, Stack Oveflow questions, right? So that's free-form text. You can now process it with an NLP API and you can do queries on it. This is also a very powerful feature that makes BigQuery even more powerful than being able to run SQL queries. It's because you can now take JavaScript libraries, that can do a lot more than SQL, and combine them with the capabilities of your SQL programs. You can work with BigQuery from the console, as I just showed you. But running it from Datalab gives you another level of flexibility because Datalab is a way by which you can run Python programs. And the good thing about Python is that it has really nice data analysis capabilities, data visualization capabilities. And now you can basically combine BigQuery with these very powerful graphical tools and data science tools. This is the way it works. You would basically go to a cell and you would mark that cell as a SQL cell. And that basically tells Datalab that this is not Python code that's going to follow, but a BigQuery SQL code. So you say, there is SQL, I'm going to give it a name, wxquery. Then in another cell, which is a Python cell, you would create a query, wxquery, which would basically link to that. Say, whenever I call wxquery, this is the SQL query that I want to run. I want to select the day of the year and all of those things and you supply a YEAR=2015. And what this does is that in this query, wherever $YEAR occurs, that's going to get replaced by 2015. So if we run this query and we will get back a BigQuery result set, but we say take that BigQuery result set and convert it into a Pandas dataframe. Pandas is a Python data analysis library. So what comes back, whether, is a Pandas dataframe. And now all of the Pandas, Matplotlib and NumPy, all of those capabilities that data scientists in Python use, they are now completely available off the result of a BigQuery data set. 

# Lab: Create ML Dataset with BigQuery

# Machine learning with TensorFlow

# Lab: Carry Out ML with TensorFlow

# Fully build machine learning models & lab

# Lab: Machine Learning APIs