many data-preprocessing and feature-engineering techniques are domain specific (i.e., specific to text data or image data)

##### Data Preprocessing for Neural Networks

making the raw data more amenable to neural networks: vectorization, normalization, handling missing values and feature extraction

VECTORIZATION:
- all inputs and targets must be tensors of floating-point data (or in specific cases, tensors of integers)

VALUE NORMALIZATION:
- in general, it isn't safe to feed into a neural network data that takes relatively large values
    - for example, multidigit integers, which are much larger than the initial values taken by the weights of a network
    - or data that is heterogeneous (for example data where one feature is in the range 0-1 and another is in the range 100-200)
- large gradient updates will prevent the network from converging
- the data should have the following characteristics:
    - take small value -- typically 0-1 range
    - be homogenous -- features should take values in roungly the same range
    - common stricter normalizastion:
        - normalize each feature independently to have a mean of 0
        - normalize each feature independently to have a standard deviation of 1
        - see below numpy arrays:

In [None]:
# assuming x is a 2D data matrix of shape (samples, features)
x -= x.mean(axis=0)
x /= x.std(axis=0)

HANDLING MISSING VALUES:

- in general, with neural networks, it's safe to tinput missing values as 0
    - with the condition that 0 isn't already a meaningful value
    - the network will learn from exposure to the data that value 0 means 'missing data' and will start ignoring the value

note:
- if missing values are expecting in the test data, but the network was trained on data without any missing values, the network won't have learned to ignore missing values
- one should artificially generate training samples with missing entreis (drop some)

##### Feature Engineering

to make the algorithm work better by applying hardcoded (non-learned) transformations before it goes into the model

![Feature_Engineering_for_Reading_the_Time_on_a_Clock](./f_eng_time.png)

- before deep learning, feature engineering used to be critical
    - classical shallow algorithms didn't have hypothesis spaces rich enough to learn useful features by themselves
    - before the invention of convolutional nerual network

modern deep leraning removes the need for most feature engineering, because neural networks are capable of automatically extracting useful features from raw data, still notes:

- good features allow to solve problems more elegantly while using fewer resources
- good features let solve a problem with far less data

##### Overfitting and Underfitting

when converting the model, the performance on the held-out validation data always peaked after a few epochs and then began to degrade: the model quickly strat to overfit to the training data

the fundamental issue in machine learning is teh tesion between optimization and generalization:

- 'optimization' refers to the process of adjusting a model to get the best performance opssible on the training data
- 'generalization' refers to how well the trained model perfoms on data it has never seen before

at the begining of training, optimization and generalization are correlated -- 'under fit'

after a certain number of iterations on teh training data, generalization stops improving, and validation metrics stall and then begin to degrade -- 'overfit' (begining to learn patterns taht are specific to the traiing data but misleading or irrelevant to the new data)

'the best solution is to get more training data' -- a model trained on more data will naturally generalize better

'the next-best solution is to modulate the quantity of information taht the model is allowed to store or to add constraints on what information it's allowed to store'

the processing of fighting overfitting this way is called 'regluarization'

##### Reducing the Network's Size

the simplest way to prevent overfitting is to reduce the size of the model: the number of learnable parameters in the model (which is determined by the number of layers and teh number of units per layer)

in deep learning, the number of learnable parameters referred to as the model's 'capacity', there is a compromise to be found between 'too much capacity and not enough capacity'

there is no magical formula to determine the right number of layers or the right size for each layer -- evaluate an arry of different architectures

the general workflow to find an appropriate model size is to start with relatively few layers and parameters, and increase the size of the layers or add new layers until see diminishing returns with regard to validation loss

the movie-review classification network:

In [3]:
# original model

from keras import models
from keras import layers

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

2022-12-12 13:38:29.755554: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 13:38:29.831760: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 13:38:29.831974: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-12-12 13:38:29.832735: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compi

In [4]:
# version of the model with lower capacity

model = models.Sequential()
model.add(layers.Dense(4, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(4, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

![Effect of Model Capacity on Validation Loss: Trying a Smaller Model](./m_cpcty_val_s.png)

In [7]:
# version of the model with hifgher capacity

model = models.Sequential()
model.add(layers.Dense(512, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(512, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

![Effect of Model Capacity on Validation Loss: Trying a Bigger Model](./m_cpcty_val_b.png)

- the bigger network gets its training loss near zero very quickly
- the more capacity the network has, the more quickly it can model the training data (resulting in a low training loss), but the more susceptible it is to overfitting (resulting in a large difference between the traiing and validation loss)

![Effect of Model Capacity on Training Loss: Trying a Bigger Model](./m_cpcty_train_b.png)

##### Adding Weight Regularization

- Occam's Razor: 
    - given two explanations for something, the explanation most likely to be correct is the simplest one -- the one that makes fewer assumptions

simpler models are less likely to overfit tahn complex one

- weight regularization: (adding to the loss function of the network a cost associated with having large weights)
    - a common way to mitegate overfitting is to put constraints on the complexity of a network by forcing its weights to make only small values, which makes the distribution of weight values more regular

- L1 regularization: (L1 norm of the weights)
    - the cost added is proportional to the absolute value of the weight coefficients 
- L2 regularization: (L2 norm of the weights) (weight decay)
    - the cost added is proportional to the squre of the value of the weight coeefficients

L2 weight regularization to the movie-review classification network:

In [11]:
# adding L2 weight regularization to the model

from keras import regularizers 

model = models.Sequential()
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
                       activation='relu', input_shape=(10000,)))
model.add(layers.Dense(16, kernel_regularizer=regularizers.l2(0.001),
                       activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))

l2(0.001) means every coefficient in the weight matrix of the layer will add 0.001 * weight_coefficient_value to the total loss of the network (note taht because this penalty is only added at training time, the loss for this network will be much higher at training than at test time)

![Effect of L2 Weight Regularization on Validation Loss](./l2_reg_val.png)

In [None]:
# different weight regularizers available in keras

from keras import regularizers

regularizers.l1(0.001)

regularizers.l1_l2(l1=0.001, l2=0.001)

##### Adding Dropout

dropout is one of the most effective and most commonly used regularization techniqus for neural networks

dropout, applied to a layer, consists of randomly 'dropping out' (setting to zero) a number of output features of the layer during training

the 'dropout rate' is the fraction of the features that are zeroed out; instead, the layer's output values are scaled down by a factor equal to the dropout rate, to balance for the fact that more units are active than at training time

consider a numpy matrix constaining the output of a layer -- 'layer_output' of shape (batch_size, features); and zero out at random a fraction of the values in the matrix:

In [None]:
# at training time, drops out 50% of the units in the output at test
layer_output *= np.random.randint(0, high=2. size=layer_output.shape)
layer_output *= 0.5

In [None]:
# at test time, scaling up rather scaling down
layer_output *= np.random.randint(0, high=2, size=layer_output.shape)
layer_output /= 0.5

![Dropout](./50%_dropout.png)

introducing noise in the output values of a layer can break up happenstance patterns that aren't significant, which the network will start memorizing if no noise is present

dropout layer in Keras:

In [None]:
model.add(layers.Dropout(0.5))

In [None]:
# adding dropout to the IMDB network

model = models.Sequential()
model.add(layers.Dense(16, activation='relu', input_shape=(10000,)))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(16, activation='relu'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(1, activation='sigmoid'))

![Effect of Dropout on Validation Loss](./ef_dropout_val_loss.png)

- recap: most common ways to prevent overfitting in neural network
    - get more training data
    - reduce the capacity of the network
    - add weight regularization
    - add dropout