### `About Deepl-Learning`
- Source Link `https://towardsdatascience.com/deep-learning-with-python-neural-networks-complete-tutorial-6b53c0b06af0`
- TensorFlow(by Goolge) - Production Ready 
- PyTorch(by Facebook) - Good for rapid prototypes - Easy to learn
- Both levarage power of NVIDIA GPU's. It is useful for processing big datasets(corpus of text & gallery of images)
- TensorFlow a higher-level module way more user-friendly than pure TensorFlow
- Enabling GPU support, python syntax will be translatedinto GUDA by machine and processed by GPU's, so model shall run incredibly faster
- Artificial Neural Networks
  - ANN are made of layers(input & output dimension)
  - Neurons known as "nodes". It is computational unit that connects the weighted inputs through an `activation function - helps neuron to switch on/off`
  - `Weights` are randomly initialised and optimised during training to minimize a loss function
  - Input Data - Matrix of 3 features(shape N*3)
  - `Input Layer` - Takes 3 numbers as input and passes the same 3 numbers to next layers
  - `Hidden Layer` - represent intermediary nodes, they do serveral transformations to improve accuracy of final result & output is defined by number of neurons
  - `Output Layer` - returns final output if Neural Network. For `Binary classification - o/p layer contains 1 neuron to return only 1 number`, for `Multiclass Classification(5 classes) - o/p layer contains 5 neurons`
  - Simplest form of ANN is `Perceptron`. A model with 1 layer only - very similar to `Linear Regression`
  - Asking what happens inside a Perceptron is equivalent to asking what happens inside a single node of a multi-layer Neural Network
  - Data should always be scaled before being fed into Neural Network
  - Just like in every other ML use case, we are going to train a model to predict the target using the features row by row. Let's start with the first row.
  - 'Training Model' means searching best parameters in a mathematical formula that minimize the error of your predictions
    - In `Linear Regression Model` you have to find best weights
    - In `Tree-Based Model(Random Forest)` you have to find best splitting points
  - Weights are randomly initialised then adjusted as the learning proceeds
  - In `linear model - Σ(xi * wights + bias)` - we can use Linear Regression
  - In `non-linear model - f(Σ(xi * wights ) + bias)` - we can use Deep Learning, where `activation functions` evaluates the regression formula output & helps in learning weights
  - Activation function defines output of that node. We can use out of many activation functions or custom functions following - `Keras - Activation Function / Wiki - Activation Function`
  - Ex:- Binary step - Activation Function returns 1 or 0 only
  - Perceptron - Single_Layer Network take input 1 row(3 column values) do regression formula and gets output goes into activation function and defines output(or give further output). Then that output compared with target & calculating error & optimizing weights, reiterating the whole process again and again. 
- Deep Neural Networks   
  - A Neural Network can be called as `Deep Neural Network` when it has atleast 2 hiddem layers.
  - Imagining replicating neuron process 3 times(i.e., 3 neurons in 1st layer) simaltaneously. Since each node(weighted sum & activation function) returns a value. We would have `1st hidden layer with 3 outputs` 
  - Now using those 3 outputs as inputs for the 2nd hidden layer, which returns 3 new numbers. Finally we shall add an output layer(1 node only) to get final prediction of our model.
  - Remember that layers can have a diff. number of neurons and a diff. activation func. and in each node, weights are trained to optimize the final result.
  - That's why the more layers you add, the bigger the no. of trainable parameters gets.
  - `Bias`: inside each neuron, the linear combination of inputs and weights includes also a bias, similar to the constant in a linear equation, there full formula of a neuron is:
    - `f(Σ(xi * wights ) + bias)`
  - `Backpropagation:` During Training, the model learns by propagating the error back into the nodes and updating the parameters(weights and biases) to minimize the loss.
  - `Gradient Descent:` Optimization algorithm used to train Neural Network which finds local minimum of the loss function by taking repeated steps in direction of steepest descent



### `Import Libraries`

In [2]:
import pandas as pd 
import numpy as np
# import tensorflow as tf
# import tensorflow.compat.v1 as tf
# tf.disable_v2_behavior()
import shap
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from math import sqrt
from io import StringIO
from IPython.display import display, HTML
from tensorflow.keras import models, layers, utils, backend as bk

warnings.filterwarnings("ignore")
%matplotlib inline

In [3]:
# !pip install tensorflow
# !pip install shap

### `Sample Data`

In [4]:
companya_sales_data = """
Product_Sell,Revenue_Generation
10,1000
15,1400
18,1800
22,2400
26,2600
30,2800
5,700
31,2900
"""

df = pd.read_table(StringIO(companya_sales_data), sep=",")
display(HTML(df.to_html()))

Unnamed: 0,Product_Sell,Revenue_Generation
0,10,1000
1,15,1400
2,18,1800
3,22,2400
4,26,2600
5,30,2800
6,5,700
7,31,2900


### `Train & Test Split`

In [5]:
X = df["Product_Sell"].values
y = df["Revenue_Generation"].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print("X_train.shape : ", X_train.shape); print("X_test.shape : ", X_test.shape)

print("X_train --> ", X_train)
print("X_test --> ", X_test)
print("y_train --> ", y_train)
print("y_test --> ", y_test)

X_train.shape :  (5,)
X_test.shape :  (3,)
X_train -->  [26 10 30 22  5]
X_test -->  [31 15 18]
y_train -->  [2600 1000 2800 2400  700]
y_test -->  [2900 1400 1800]


### `Artifical Neural Network : Perceptron / 1 Dense Layer`

- `Model - Fit`

In [6]:
model = models.Sequential(name="Perceptron", layers=[
    ##### Fully Connected Layer
    layers.Dense(
        name = "dense", 
        input_dim = 1, # with 3 features as the input
        activation = "relu", # [rectified linear unit] / linear /  [f(x) = x ]
        units = 1, # 1 Node because we want 1 output
    )
])

model.summary()

Model: "Perceptron"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 1)                 2         
                                                                 
Total params: 2
Trainable params: 2
Non-trainable params: 0
_________________________________________________________________


#### `Observation`
- 2 params (1 Weight & 1 Bias)

- `Activation Function`
  - The activation function used in the hidden layers is a `rectified linear unit, or ReLU`. 
  - It is the most widely used activation function because of its advantages of being nonlinear, as well as the ability to not activate all the neurons at the same time. 
  - In simple terms, this means that at a time, only a few neurons are activated, making the network sparse and very efficient.
- `optimizer`
  - optimizer and the loss measure for training. The `Mean Squared Error : Loss Measure(Loss Function)` and the `Adam Optimizer : minimization algorithm(minimises loss value)`. 
  - The main advantage of the "adam" optimizer is that we don't need to specify the learning rate as is the case with gradient descent; thereby saving us the task of optimizing the learning rate for our model.  
- `epochs`
  - Represents the number of training iterations

- `Without Data Scaling`
- `epochs=20`

In [7]:
model.compile(loss= "mean_squared_error" , optimizer="adam", metrics=["mean_squared_error"])
model.fit(X_train, y_train, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7ff7e9bf5710>

- `Model Evaluation`

In [9]:
print("Predictions on Train Data : ")
print("--"*20)
pred_train = model.predict(X_train)
print("y_train --> ", y_train)
print("pred_train --> ", pred_train.tolist())
print("MSE --> ", np.sqrt(mean_squared_error(y_train,pred_train)))
print("--"*20)
print("Predictions on Test Data : ")
print("--"*20)
pred_test = model.predict(X_test)
print("y_test --> ", y_test)
print("pred --> ", pred_test.tolist())
print("MSE --> ", np.sqrt(mean_squared_error(y_test,pred_test))) 

Predictions on Train Data : 
----------------------------------------
y_train -->  [2600 1000 2800 2400  700]
pred_train -->  [[14.125604629516602], [5.445232391357422], [16.295698165893555], [11.955511093139648], [2.7326161861419678]]
MSE -->  2079.126427131573
----------------------------------------
Predictions on Test Data : 
----------------------------------------
y_test -->  [2900 1400 1800]
pred -->  [[16.83822250366211], [8.157848358154297], [9.785418510437012]]
MSE -->  2117.7594022992553


#### `Observation`
- Without data scaling ANN. doesn't make correct predictions. As it is not classic Linear Regression to check relation between indep. & dep. variables. 
- So variables(indep. & dep.) should be on same scale  

### `Data Scaling`

In [10]:
##### Transform Independent Data #####
print("Transform Independent Data : ")
print("--"*20)
print("Before Scaling : X_train --> ", X_train)
print("Before Scaling : X_test --> ", X_test)
# Initialise MinMaxScaler
x_scaler = MinMaxScaler()
X_train_arr = np.array(X_train).reshape(-1, 1)
X_test_arr = np.array(X_test).reshape(-1, 1)
x_scaler.fit(X_train_arr)
X_train_scaled = x_scaler.transform(X_train_arr)
X_test_scaled = x_scaler.transform(X_test_arr)
print("After Scaling : X_train_scaled --> ", X_train_scaled)
print("After Scaling : X_test_scaled --> ", X_test_scaled)
##### Transform Dependent Data #####
print("Transform Dependent Data : ")
print("--"*20)
print("Before Scaling : y_train --> ", y_train)
# Initialise MinMaxScaler
y_scaler = MinMaxScaler()
y_train_arr = np.array(y_train).reshape(-1, 1)
y_scaler.fit(y_train_arr)
y_train_scaled = y_scaler.transform(y_train_arr)
print("After Scaling : y_train_scaled --> ", y_train_scaled)

Transform Independent Data : 
----------------------------------------
Before Scaling : X_train -->  [26 10 30 22  5]
Before Scaling : X_test -->  [31 15 18]
After Scaling : X_train_scaled -->  [[0.84]
 [0.2 ]
 [1.  ]
 [0.68]
 [0.  ]]
After Scaling : X_test_scaled -->  [[1.04]
 [0.4 ]
 [0.52]]
Transform Dependent Data : 
----------------------------------------
Before Scaling : y_train -->  [2600 1000 2800 2400  700]
After Scaling : y_train_scaled -->  [[0.9047619 ]
 [0.14285714]
 [1.        ]
 [0.80952381]
 [0.        ]]


- `With Data Scaling`
- `epochs=30`

In [17]:
model.compile(loss= "mean_squared_error" , optimizer="adam", metrics=["mean_squared_error"])
model.fit(X_train, y_train, epochs=40)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0x7ff7ead94750>

- `Model Evaluation`

In [18]:
print("Train Predictions : ")
print("--"*20)
pred = model.predict(X_train_scaled)
#invert normalize
y_train = y_scaler.inverse_transform(y_train_scaled)
train_predictions = y_scaler.inverse_transform(pred) 
print("y_train --> ", y_train)
print("train_predictions --> ", train_predictions)
print("MSE --> ", np.sqrt(mean_squared_error(y_train,train_predictions)))

print("Test Predictions : ")
print("--"*20)
pred = model.predict(X_test_scaled)
#invert normalize
test_predictions = y_scaler.inverse_transform(pred) 
print("y_test --> ", y_test)
print("test_predictions --> ", test_predictions)
print("MSE --> ", np.sqrt(mean_squared_error(y_test,test_predictions)))

Train Predictions : 
----------------------------------------
y_train -->  [[2600.]
 [1000.]
 [2800.]
 [2400.]
 [ 700.]]
train_predictions -->  [[2161.0076 ]
 [1271.066  ]
 [2383.4927 ]
 [1938.522  ]
 [ 992.95935]]
MSE -->  384.30629842480687
Test Predictions : 
----------------------------------------
y_test -->  [2900 1400 1800]
test_predictions -->  [[2439.1143]
 [1549.1727]
 [1716.0366]]
MSE -->  283.8532599620366


#### `Observation`
- After data scaling & increased training iterations i.e., epochs. With single perceptron, ANN. is able to make correct predictions. 
- As data(indep. & dep.) is some what linearly related. So with single perceptron & less epoch we were able to get good predictions. 

<!--  -->

<!--  -->

<!--  -->

### `Deep Learning Neural Network - 2 Layers`

- `Model - Fit`

In [19]:
# Define model
model = models.Sequential()
model.add(layers.Dense(5, input_dim=1, activation= "relu"))
model.add(layers.Dense(1))
model.summary() #Print model Summary

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 5)                 10        
                                                                 
 dense_1 (Dense)             (None, 1)                 6         
                                                                 
Total params: 16
Trainable params: 16
Non-trainable params: 0
_________________________________________________________________


#### `Observation`
- 1st layer has 1 input(indep. variable value)
- 5 Nodes in 1st layer
- 1 Node in 2nd layer
- Each node has 2 parameters in this case i.e., 1 indep.var coeff + bias
- So 1st layer has = 5 * 10 = Total 10 trainable parameters
- 2nd layer has 5 inputs(i.e., 1st layer ouput). Hence, 5 indep.var coeff + bias = 6 trainable parameters
- On total, 10 + 6 = `16 trainable parameters` in this deepl learning neural network

- `With Data Scaling`
- `epochs=20`

In [20]:
model.compile(loss= "mean_squared_error" , optimizer="adam", metrics=["mean_squared_error"])
model.fit(X_train_scaled, y_train_scaled, epochs=20)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7ff7eb117d10>

- `Model Evaluation`

In [21]:
print("Train Predictions : ")
print("--"*20)
pred = model.predict(X_train_scaled)
#invert normalize
y_train = y_scaler.inverse_transform(y_train_scaled)
train_predictions = y_scaler.inverse_transform(pred) 
print("y_train --> ", y_train)
print("train_predictions --> ", train_predictions)
print("MSE --> ", np.sqrt(mean_squared_error(y_train,train_predictions)))

print("Test Predictions : ")
print("--"*20)
pred = model.predict(X_test_scaled)
#invert normalize
test_predictions = y_scaler.inverse_transform(pred) 
print("y_test --> ", y_test)
print("test_predictions --> ", test_predictions)
print("MSE --> ", np.sqrt(mean_squared_error(y_test,test_predictions)))

Train Predictions : 
----------------------------------------
y_train -->  [[2600.]
 [1000.]
 [2800.]
 [2400.]
 [ 700.]]
train_predictions -->  [[361.43274]
 [736.11523]
 [267.7621 ]
 [455.1032 ]
 [781.4446 ]]
MSE -->  1748.273671390504
Test Predictions : 
----------------------------------------
y_test -->  [2900 1400 1800]
test_predictions -->  [[244.34433]
 [619.0269 ]
 [548.774  ]]
MSE -->  1753.852191766989


#### `Observation`
- In single perceptron(single layer-single node-single activation function), predictions were good with increasing epochs. 
- But in deep learning i.e., dense layers as the training parameters increase upto 16. Traning become complex even on less data i.e., too linearly related data.
- This DL network give us good results on high epoch numbers. So as to train 16 parameters in correct direction.
- May be DL network used in multi linear regression problem & ANN(Single perceptron) is used in Simple Linear Regression. 

### `Reference Links`
- https://www.h2kinfosys.com/blog/linear-regression-with-keras-on-tensorflow/
- https://www.pluralsight.com/guides/regression-keras
- https://datascienceplus.com/keras-regression-based-neural-networks/
