# CNN for NLP

## classic models

- Sentence Classification (Kim 2014): word embeddings followed by convolutional and max-pooling layers and a fully connected output layer.

- Dynamic Convolutional Neural Network (2014 Kalchbrenner): dynamic k-max pooling operation to handle variable-length input sentences and capture n-gram features

- Paraphrase Identification (Yin & Schütze 2015)

- Summarization-based Video Caption (Li et al. 2015)

- Question Answering (Dong et al. 2015)

- Matching Natural Language Sentences (Hu et al. 2015)

- Learning Semantic Representations (Sheng et al. 2015)

- Sentiment Analysis of Short Texts (dos Santos & Gatti 2014)

- Relation Extraction (Nguyen & Grishman 2015)

- Entity Disambiguation (Sun et al.2015)

- Modeling Interestingness with Deep Neural Networks (Gao 2015)

- Character-Aware word embedding (2016 Kim)

- Text Classification (2016 Conneau)

## adjustment of CNN for NLP

for NLP tasks, CNNs need some adjustments to handle sequential data effectively

- wor embedding matrix for Input representation: each row of the matrix corresponds to the embedding of a token in the input sequence. can be Static or/and non-static channels

- 1D convolutions: process input along sequence dimension. Filters slide across the sequence capturing local patterns (n-grams).

- padding, truncation, or dynamic pooling: handle various-length inputs and ensure consistent output sizes.

- Stacking: capture the hierarchical structure of language. Same idea as with RNN and LSTM

- Dilated convolution: handle long-range dependency

### static vs. non-static channel

static and non-static channels refer to the way word embeddings are treated during training.

In CNN models for NLP, both static and non-static channels can be used simultaneously. 

leverage the benefits of both approaches, with static channel providing general semantic information and non-static channel capturing task-specific nuances.

- Static channel: 

    word embeddings are fixed throughout the training without adjusting them to the specific task at hand. 
    
    e.g., pre-trained word vectors like Word2Vec or GloVe.

    pros: prevent overfitting, especially when there is a limited amount of training data. 

    cons: fixed parameters can't be fine-tuned for specific task


- Non-static channel: 

    word embeddings are initialized with pre-trained word vectors but are updated during the training process. 

    pros: improved performance, as the embeddings can be adapted to capture nuances relevant to the target task.
    
    cons: overfitting, especially if the available training data is limited.

<h5 align='center'>CNN model architecture with a single static channel for an example utterance.<h5/>
<img src='https://www.researchgate.net/profile/Hyunjung-Lee-3/publication/315468897/figure/fig1/AS:501882825461760@1496669598515/CNN-model-architecture-with-single-channel-for-an-example-utterance.png' />

<h5 align='center'>CNN model architecture with 2 channels (static and non-static) for an example utterance.<h5/>
<img src='https://d3i71xaburhd42.cloudfront.net/15637f68230040fe0792fa505573049aa64045ee/5-Figure3-1.png' />