# Prepare Dataset for Training Using TensorFlow

In this guide, we will prepare a dataset for training using TensorFlow. We will perform several key steps to get the data ready for training a machine learning model.

## 1. Create the Dataset

First, we will create a dataset with a range of 10 elements using TensorFlow's `tf.data.Dataset.range(10)`.

## 2. Windowing and Shifting

Next, we will apply the windowing technique to our dataset. We can define the size of each window and the shift between consecutive windows. We'll explore the impact of setting `drop_remainder` to `True`.

- **Window Size**: We will specify the size of each window, which determines how many elements are grouped together in each window.

- **Shift**: The shift parameter defines how the window moves forward after creating each window. A shift of 1 means the windows will overlap by one element.

- **drop_remainder=True**: Setting this parameter to `True` will drop any incomplete windows at the end of the dataset if there are fewer elements than the specified window size.

## 3. Flattening the Dataset

To further process the data, we will flatten the dataset of windows. We'll use the `flat_map` function to apply a lambda function to each window and then concatenate the results into a single dataset. This is typically done to convert windows of data into individual elements for easier handling.

## 4. Feature Engineering and Label Creation

After flattening the dataset, we can use the `map` function to apply transformations to the data. This is often used for feature engineering and label creation. You can define custom functions to modify or extract features from the data.

## 5. Data Shuffling

Shuffling the data is a good practice to reduce sequence bias when training a model. We'll shuffle the dataset to ensure that the order of examples doesn't affect the training process.

## 6. Batching

To train a model, we'll create batches of data. Batching groups several examples together into a single batch, which is more efficient for model training.

## 7. Prefetching

Finally, we'll use the `prefetch` function to prefetch data for the next batch. Prefetching helps in reducing training time by overlapping the data loading and model training phases.

By following these steps, we'll have a well-prepared dataset ready for training machine learning models in TensorFlow.


In [1]:
import tensorflow as tf

In [11]:
# generate the data set with 10 elements 
dataset = tf.data.Dataset.range(10)

In [12]:
dataset

<RangeDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>

In [13]:
for val in dataset:
    print(val.numpy())

0
1
2
3
4
5
6
7
8
9


# Windowing the data

In [14]:
dataset = dataset.window(size=5 , shift=1)

In [17]:
for window_data in dataset : 
    print([item.numpy() for item in window_data])



[0, 1, 2, 3, 4]
[1, 2, 3, 4, 5]
[2, 3, 4, 5, 6]
[3, 4, 5, 6, 7]
[4, 5, 6, 7, 8]
[5, 6, 7, 8, 9]
[6, 7, 8, 9]
[7, 8, 9]
[8, 9]
[9]


In [45]:
import tensorflow as tf


for window_data in dataset:
    
    list2 = []
    for item in window_data:
        list2.append(item.numpy())
    print(list2)



[0, 1, 2, 3, 4]
[1, 2, 3, 4, 5]
[2, 3, 4, 5, 6]
[3, 4, 5, 6, 7]
[4, 5, 6, 7, 8]
[5, 6, 7, 8, 9]
[6, 7, 8, 9]
[7, 8, 9]
[8, 9]
[9]


In [50]:
# to make dataset only return 5 elements we could use drop remainder
dataset = tf.data.Dataset.range(10)

In [51]:
dataset = dataset.window(size=5 , shift=1 , drop_remainder=True)

In [59]:
for window_data in dataset:
    print([item.numpy()) for item in window_data])

[<tf.Tensor: shape=(), dtype=int64, numpy=0>, <tf.Tensor: shape=(), dtype=int64, numpy=1>, <tf.Tensor: shape=(), dtype=int64, numpy=2>, <tf.Tensor: shape=(), dtype=int64, numpy=3>, <tf.Tensor: shape=(), dtype=int64, numpy=4>]
[<tf.Tensor: shape=(), dtype=int64, numpy=1>, <tf.Tensor: shape=(), dtype=int64, numpy=2>, <tf.Tensor: shape=(), dtype=int64, numpy=3>, <tf.Tensor: shape=(), dtype=int64, numpy=4>, <tf.Tensor: shape=(), dtype=int64, numpy=5>]
[<tf.Tensor: shape=(), dtype=int64, numpy=2>, <tf.Tensor: shape=(), dtype=int64, numpy=3>, <tf.Tensor: shape=(), dtype=int64, numpy=4>, <tf.Tensor: shape=(), dtype=int64, numpy=5>, <tf.Tensor: shape=(), dtype=int64, numpy=6>]
[<tf.Tensor: shape=(), dtype=int64, numpy=3>, <tf.Tensor: shape=(), dtype=int64, numpy=4>, <tf.Tensor: shape=(), dtype=int64, numpy=5>, <tf.Tensor: shape=(), dtype=int64, numpy=6>, <tf.Tensor: shape=(), dtype=int64, numpy=7>]
[<tf.Tensor: shape=(), dtype=int64, numpy=4>, <tf.Tensor: shape=(), dtype=int64, numpy=5>, <tf.T

In [53]:
# to use flattern the dataset

dataset = dataset.flat_map(lambda window : window.batch(5))

In [60]:
for window_data in dataset:
    print(window_data.numpy())

[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7]
[4 5 6 7 8]
[5 6 7 8 9]


In [76]:
# group into feature and lable 

dataset = tf.data.Dataset.range(10)
dataset = dataset.window( 5 , shift=1 , drop_remainder=True)
dataset = dataset.flat_map(lambda window : window.batch(5))
dataset = dataset.map(lambda mywindow : (mywindow[:-1] ,mywindow[-1]))


# suffle the data it is good practice to shuffle your to reduce sequence bias 
dataset = dataset.shuffle(buffer_size=10)



# Create batches of windows
dataset = dataset.batch(2).prefetch(1)   # by specifying a prefetch buffer size of 1 tensorflow will prepare the next batch in advance 

# Print the results
for x,y in dataset:
  print("x = ", x.numpy())
  print("y = ", y.numpy())
  print()

x =  [[2 3 4 5]
 [3 4 5 6]]
y =  [6 7]

x =  [[1 2 3 4]
 [5 6 7 8]]
y =  [5 9]

x =  [[4 5 6 7]
 [0 1 2 3]]
y =  [8 4]

