### 1.	Why would you want to use the Data API?


API is an acronym for Application Programming Interface that software uses to access data, server software or other applications and have been around for quite some time. APIs are very versatile and can be used on web-based systems, operating systems, database systems and computer hardware.

APIs are needed to bring applications together in order to perform a designed function built around sharing data and executing pre-defined processes. They work as the middle man, allowing developers to build new programmatic interactions between the various applications people and businesses use on a daily basis.

The Data API can be enabled for Aurora Serverless DB clusters using specific Aurora MySQL and Aurora PostgreSQL versions only. For more information, see Data API for Aurora Serverless.

### 2.	What are the benefits of splitting a large dataset into multiple files?


Splitting a shared database can help improve its performance and reduce the chance of database file corruption. After you split database, you may decide to move the back-end database, or to use a different back-end database. You can use the Linked Table Manager to change the back-end database that you use.

Even in non-relational databases, you’d want to have logically different data in different “buckets” in most situations. Putting your user list in the same table as your game equipment properties wouldn’t make much sense, even if these are stored in Cassandra or MongoDB and you aren’t going to use joins.

Among other badness, even if you have a freeform name-value-pair storage scheme or giant list of JSONs, sticking everything in one database “table-thing” will impair search and lookup performance and make maintaining your data more tedious and difficult.

In a relational database, you almost never want generic “dumping ground” tables, as this will invariably cause problems with Database normalization, performance, and ongoing database maintenance.

Occasionally you’ll have more “freeform” tables that may have a few property columns and one or more “blobby” VARCHARs or TEXT fields - common use-cases for these types of tables are log tables, user activity tables, or audit tables, but even these should be normalized and use other tables for things they routinely reference, such as users.

### 3.	During training, how can you tell that your input pipeline is the bottleneck? What can you do to fix it?


Identifying the Bottleneck
There are a number of different tools and techniques for evaluating the runtime performance of a training session, and identifying and studying an input pipeline bottleneck. Let’s review just a few of them:

1. System Metrics

2. Performance Profilers

3. Throughput Measurement

four steps for addressing the preprocessing data bottleneck.

1. Identify any operations that can be moved to the data preparation phase

2. Optimize the data pre-processing pipeline

3. Perform some of the pre-processing steps on the GPU

4. Use the TensorFlow data service to offload some of the CPU compute to other machines

### 4.	Can you save any binary data to a TFRecord file, or only serialized protocol buffers?


The TFRecord format is a simple format for storing a sequence of binary records. Protocol buffers are a cross-platform, cross-language library for efficient serialization of structured data. Protocol messages are defined by . proto files, these are often the easiest way to understand a message type.

The High Resolution Rapid Refresh (HRRR) model is a numerical weather model. Because weather models work best when countries all over the world pool their observations, the format for weather data is decided by the World Meteorological Organization and it is super-hard to change. So, the HRRR data is disseminated in a #@!$@&= binary format called GRIB.
Regardless of the industry you are in — manufacturing, electricity generation, pharmaceutical research, genomics, astronomy— you probably have some format like this. A format that no modern software framework supports. Even though this article is about HRRR, the techniques here will apply to any binary files you have.
The most efficient format for TensorFlow training is TensorFlow records. This is a protobuf format that makes it possible for the training program to buffer, prefetch, and parallelize the reading of records. So, a good first step for machine learning is to convert your industry-specific binary format files into TensorFlow records.

### 5.	Why would you go through the hassle of converting all your data to the Example protobuf format? Why not use your own protobuf definition?


Protocol buffers, or Protobuf, is a binary format created by Google to serialize data between different services. Google made this protocol open source and now it provides support, out of the box, to the most common languages, like JavaScript, Java, C#, Ruby and others

When a message is encoded, the keys and values are concatenated into a byte stream. When the message is being decoded, the parser needs to be able to skip fields that it doesn't recognize. This way, new fields can be added to a message without breaking old programs that do not know about them.

The main problem with protobuf for large files is that it doesn't support random access. You'll have to read the whole file, even if you only want to access a specific item. If your application will be reading the whole file to memory anyway, this is not an issue.

### 6.	When using TFRecords, when would you want to activate compression? Why not do it systematically?


The TFRecord format is a simple format for storing a sequence of binary records. Converting your data into TFRecord has many advantages, such as:

* More efficient storage: the TFRecord data can take up less space than the original data; it can also be partitioned into multiple files.

* Fast I/O: the TFRecord format can be read with parallel I/O operations, which is useful for TPUs or multiple hosts.
* Self-contained files: the TFRecord data can be read from a single source—for example, the COCO2017 dataset originally stores data in two folders ("images" and "annotations").

An important use case of the TFRecord data format is training on TPUs. First, TPUs are fast enough to benefit from optimized I/O operations. In addition, TPUs require data to be stored remotely (e.g. on Google Cloud Storage) and using the TFRecord format makes it easier to load the data without batch-downloading.

Performance using the TFRecord format can be further improved if you also use it with the tf.data API.

In this example you will learn how to convert data of different types (image, text, and numeric) into TFRecord.
Reading compressed input is implemented in TensorFlow, but supporting outputting compressed TFRecord files would be amazing, as TFRecord is a rather inefficient format in terms of space.

### 7.	Data can be preprocessed directly when writing the data files, or within the tf.data pipeline, or in preprocessing layers within your model, or using TF Transform. Can you list a few pros and cons of each option?

Data Preprocessing is carried out to remove the cause of unformatted real-world data which we discussed above. First of all, let's explain how missing data can be handled during Data Preparation. Three different steps can be executed which are given below -

* Ignoring the missing record - It is the simplest and efficient method for handling the missing data. But, this method should not be performed at the time when the number of missing values is immense or when the pattern of data is related to the unrecognized primary root of the cause of the statement problem.

* Filling the missing values manually - This is one of the best-chosen methods of Data Preparation process. But there is one limitation that when there are large data set, and missing values are significant then, this approach is not efficient as it becomes a time-consuming task.

* Filling using computed values - The missing values can also be occupied by computing mean, mode or median of the observed given values. Another method could be the predictive values in Data Preprocessing is that are computed by using any Machine Learning or Deep Learning tools and algorithms. But one drawback of this approach is that it can generate bias within the data as the calculated values are not accurate concerning the observed values.