# The Conundrum : 

When building machine learning models, we often tend to focus on the more glamorous aspects of machine learning, such as finding the best architecture and tuning hyperparameters, that we fail to recognize that our dataset is the lifeblood of our entire pipeline.

Current hardware for ML, such as TPUs and GPUs, are built with several hundred cores that enable them to massively parallelize machine learning workloads. This is a really great development, but in order to leverage this massive parallelism of such hardware, we need the right set of tools to enable us to fully utilize such high compute capabilities.

![image.png](attachment:image.png) 

In [1]:
%%capture
import os
for dirname, _, filename in os.walk("../input"):
  for files in filename:
    print(os.path.join(dirname, files))

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import tensorflow as tf
%matplotlib inline
import cv2
from tqdm import tqdm_notebook as tqdm

# Data Reading

In [3]:
train_data = pd.DataFrame(pd.read_csv("../input/plant-pathology-2020-fgvc7/train.csv"))
test_data = pd.DataFrame(pd.read_csv("../input/plant-pathology-2020-fgvc7/test.csv"))

In [4]:
train_data.head()

Unnamed: 0,image_id,healthy,multiple_diseases,rust,scab
0,Train_0,0,0,0,1
1,Train_1,0,1,0,0
2,Train_2,1,0,0,0
3,Train_3,0,0,1,0
4,Train_4,1,0,0,0


In [5]:
test_data.head()

Unnamed: 0,image_id
0,Test_0
1,Test_1
2,Test_2
3,Test_3
4,Test_4


In [6]:
print("Training data shape : = {}".format(train_data.shape))
print("Test data shape : = {}".format(test_data.shape))

Training data shape : = (1821, 5)
Test data shape : = (1821, 1)


In [7]:
image_folder_path = "../input/plant-pathology-2020-fgvc7/images/"

In [8]:
arr = train_data["image_id"]
train_images = [i for i in arr]  

arr = test_data["image_id"]
test_images = [i for i in arr]

In [9]:
print(train_images[:5])
print(test_images[:5])

['Train_0', 'Train_1', 'Train_2', 'Train_3', 'Train_4']
['Test_0', 'Test_1', 'Test_2', 'Test_3', 'Test_4']


In [10]:
def load_image(image_id) : 
  image_path = image_folder_path +image_id +".jpg"
  image = cv2.imread(image_path) 
  image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
  return image

def resize(image):
  image = cv2.resize(image, (800, 800))
  return image

# Dataset Creation Techniques : 

Having efficient data pipelines is of paramount importance for any machine learning model.

## AVOID THIS !!

![image.png](attachment:image.png)

## Go for this : 

![image.png](attachment:image.png)

TensorFlow’s Dataset module **tf.data** can be used build efficient data pipelines. **tf_data** improves the performance by prefetching the next batch of data asynchronously so that GPU need not wait for the data. We can also parallelize the process of preprocessing and loading the dataset.



# Operations On Dataset : 

* `Batches:` Combines consecutive elements of the Dataset into a single batch. Useful when you want to train smaller batches of data to avoid out of memory errors.

* `Zip:` Creates a Dataset by zipping together datasets. Useful in scenarios where you have features and labels and you need to provide the pair of feature and label for training the model.

* `Map:` Used to transform the elements of the Dataset. Useful in cases where you want to transform your raw data before feeding into the model.

# tf.data API : 

The tf.data API was built with three major focal areas. These are:
* Performance
* Flexibility
* Ease of use

tf.data provides us with the tools to squeeze out every bit of performance from our hardware accelerators such as GPUs and TPUs. On the flexibility side, tf.data widens the spectrum of options in terms of the kinds of data we want to use, without having to rely on any external tools. 

Some common sources of data supported by tf.data are Python lists, TFRecords, CSV files, and several image formats such as JPG and PNG, text, etc. This means that we don’t need any extra tools such as Pandas to build our data pipeline — all of that functionality is built into tf.data. 

tf.data provides a plethora of methods to create datasets from numpy array, CSV files, tensors and so on..Some of the methods are listed below : 

* `from_tensor_slices:` It accepts single or multiple numpy arrays or tensors. Dataset created using this method will emit only one data at a time.

* `from_tensors:` It also accepts single or multiple numpy arrays or tensors. Dataset created using this method will emit all the data at once.

* `from_generator:` Creates a Dataset whose elements are generated by a function.