In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from tqdm import tqdm

pd.options.display.max_colwidth = 200

import warnings
warnings.filterwarnings('ignore')

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames[:5]:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In this notebook we introduce basics of neural networks.

# 1. Basics of ANN

Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. Neural networks are made of perceptrons.

Perceptrons are cell units that are responsible for some mathematical computation.

<img src="https://miro.medium.com/max/1400/1*ZoT34FHnrrPxWHsXv6mK8Q.jpeg" width="60%">

In Artificial neural network, we use multiple such perceptrons to learn different attributes of the dataset. Together, they can learn highly complex non linear structure from data.

In the above picture - 

(1) is the input features

(2) are the weights of the hidden layer perceptrons

(3) are the representations learned by each perceptron

(4) is the activation function

(5) is the output


## 1a. Neural Network Architecture

Different types of cells are meant for different types of computations. e.g. - 

* Dense cell - these are meant for dense computations (basic perceptron)
* RNN cell - recurrent neural network cell is used where we have to store some temporal information in the NN
* Convolution cell - convolution cells are just like image filters, which are used to learn new features. Conv cells are mostly popular for image type of data.

Each cell takes input of a fixed dimension and generates intermediate hidden output of fixed dimension. During building the neural network, we need to make sure of the cell inputs and outputs, so that the final layer output matches with the data output dimensions.

In deep neural networks, multiple layer of multiple such cells are stacked together. e.g. - in the above diagram, we have a simple 1-layer ANN with just dense cells.

During the learning process, neural networks learn the weights of each of these cells. These are called the parameters of the neural network.

# 1b. Activation function

Activation functions are mathematical functions that are used in the hidden layers and in the final layer. By tuning activation functions, we learn more complex and non linear structures from data. Few activation functions are - 

* Linear
* Sigmoid (logistic)
* ReLU (rectified linear unit)
* Softmax
* Tanh
* Swish
* Mish
* ELU

<img src="https://miro.medium.com/max/2000/1*4ZEDRpFuCIpUjNgjDdT2Lg.png" width="50%">

ReLU is widely used activation function for intermediate layers. Sigmoid is widely used in the final layer for binary classification task. For multiclass classification, softmax is used in the final layer. In regression tasks, mostly ReLU is used in the final layer.

## 1c.How does neural network learn

In neural networks, we also need to define a loss function that calculates the loss between the actual prediction and the predicted output by the neural network. Once, loss is calculated, neural network uses $\textbf{back propagation}$ to learn the weights of the cells in it.

Forward propagation is the process of moving forward through the neural network (from inputs to the ultimate output or prediction). Backpropagation is the reverse. Except instead of signal, we are moving error backwards through our model.
 
<img src=https://miro.medium.com/max/1400/1*UY4-RIrSVgfuhAkawKIr2w.jpeg>

<img src=https://miro.medium.com/max/1400/1*0RIBu3Iz-aOOX9dyob_FHA.jpeg>

Back propagation can be thought as a kind of feedback loop for the network.

## 1d. Cost function

Cost function is used to calculate loss between actual and predicted output by the network.

<b> P.S. - although cost function and evaluation metric are very similar, there is a fine difference between two. cost function is used for backpropagation, due to which, it needs to differentiable. On the other hand, evaluation metric is used to evaluate a model's performance. </b> 

For regression the most popular loss function is Mean Squared Error (MSE) or L2 loss.

<img src="https://miro.medium.com/max/1026/1*SGhoeJ_BgcfqU06CmX41rw.png" width="20%">

Similarly, L1 loss or, Mean Absolute Error (MAE) is also used for regression task.

<img src="https://miro.medium.com/max/1066/1*piCo0iDgPmESnQkHSwAK6A.png" width="20%">

For classification, cross entropy loss is used.

<img src="https://miro.medium.com/max/1400/1*zi1wKAAGGt1Bn6mqo2MSFw.png" width="40%">

Other loss function are -

* KL divergence
* Hinge loss
* Triplet loss

## 1e. Optimization

Now that we have the loss calculated, how does NNs learn the model parameters?

As we need to learn model parameters that minimizes the overall loss, we treat the learning process as an optimization technique which updates the parameters in order to find the local (global) minima.

The name for one commonly used optimization function that adjusts weights according to the error they caused is called <b> gradient descent </b>.

Gradient is another word for slope, and slope, in its typical form on an x-y graph, represents how two variables relate to each other: rise over run, the change in money over the change in time, etc. In this particular case, the slope we care about describes the relationship between the network’s error and a single weight; i.e. that is, how does the error vary as the weight is adjusted.


These gradients are used to understand how to update each parameter in order to get the local (global) minima. In the NN, parameters are initialized with random values and then updates using the gradients calculated during backpropagation.

<img src=https://miro.medium.com/max/1201/1*VymEfQTf30evUczsTBiz4g.png width=400>

Different optimization techniques are available in the literature. To name a few -

* SGD (stochastic gradient descent)
* Adam
* RMSProp
* Adagrad
* Radam
* Adadelta


## 1f. Regularization

Regularization is a very useful technique which is used to generalize the performance of neural networks. To reduce overfitting, we need to restrict the weights of the neurons. There are different regularization parameters - L1, L2 are used to restrict the search space of paremeters.

Another idea is <b>dropout</b>. Using dropout, neural network drop randomly selected neurons during training process. 

<img src="https://miro.medium.com/max/1400/1*iWQzxhVlvadk6VAJjsgXgg.png" width="50%">

Now that we have learn the basics of neural networks, let us build a simple ANN (not from scratch though) using scikit learn. Our objective is to classify the processed clinical notes into different specialities.

In [None]:
df = pd.read_csv('/kaggle/input/nlp-specialization-data/Cleaned_POS_Medical_Notes.csv') #for excel file use read_excel
df

In [None]:
df.head(5)

In [None]:
df['label'].nunique()

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vector = TfidfVectorizer(lowercase=True, #this will convert all the tokens into lower case
                         stop_words='english', #remove english stopwords from vocabulary. if we need the stopwords this value should be None
                         analyzer='word', #tokens should be words. we can also use char for character tokens
                         max_features=5000, #maximum vocabulary size to restrict too many features
                         min_df = 5)

tfidf_vectorized_corpus = tfidf_vector.fit_transform(df.clean_text)

In [None]:
print (tfidf_vectorized_corpus.shape)

In [None]:
tfidf_vectorized_corpus.toarray()

In [None]:
#y = keras.utils.to_categorical(df['label'].values, 5)
y=pd.get_dummies(df['label']).values
y

In [None]:
# Using Tensorflow Keras instead of the original Keras

from tensorflow import keras
from tensorflow.keras.datasets import mnist
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense

from sklearn import metrics

In [None]:
model = Sequential()
model.add(Dense(64,input_shape = (3843,), activation = 'relu'))
model.add(Dense(16, activation = 'relu'))
model.add(Dense(4, activation = 'relu'))
model.add(Dense(5,activation='softmax'))
model.summary()

In [None]:
model.compile(loss='categorical_crossentropy',metrics=['accuracy',])

In [None]:
model.fit(tfidf_vectorized_corpus.toarray(), y, epochs=20, batch_size=128, verbose=1)

In [None]:
y_predict = model.predict(tfidf_vectorized_corpus.toarray())

In [None]:
print(y_predict[0])
print(np.argmax(y_predict[0]))

In [None]:
y_pred = []
for val in y_predict:
    y_pred.append(np.argmax(val))
    
y_pred[0:5]

In [None]:
pd.crosstab(df['label'],pd.Series(y_pred))

Our basic ANN achieved average F1 score of 74% on the cross validation, which is little worse than logistic and naive bayes. However, ANN performs better than random forest. Tree based methods are widely successful for tabular dataset, however, they fail to capture semantic information from text data.

### References

1. https://towardsdatascience.com/common-loss-functions-in-machine-learning-46af0ffc4d23

2. https://towardsdatascience.com/introducing-deep-learning-and-neural-networks-deep-learning-for-rookies-1-bd68f9cf5883

3. https://pathmind.com/wiki/neural-network#forward

4. https://towardsdatascience.com/complete-guide-of-activation-functions-34076e95d044

5. https://towardsdatascience.com/understanding-neural-networks-19020b758230

6. https://medium.com/@amarbudhiraja/https-medium-com-amarbudhiraja-learning-less-to-learn-better-dropout-in-deep-machine-learning-74334da4bfc5

