# Papers to NLP

## Toxicity Detection for Indic Multilingual Social Media Content:
https://arxiv.org/pdf/2201.00598.pdf
#### Models
* XLM-RoBERTa - It is a transformer-based masked language model trained on 100 languages, using more than two terabytes of filtered CommonCrawl data. [3]
* mBERT - MultilingualBERT (mBERT) is a transformer based language model trained on raw Wikipedia text of 104 languages. This model is contextual and its training requires no supervision - no alignment between the languages is done. [5]
* RemBERT - This model is pretrained on 110 languages using a Masked Language Modeling (MLM) objective. Its main difference with mBERT is that the input and output embeddings are not tied. Instead, RemBERT uses small input embeddings and large output embeddings. [2]

#### Architecture
MLM and Transliterated data had a positive impact on the performance.
one was the original architecture where we took the transformer outputs and found the probabilities and in the second one, we added a custom attention head for the transformer output before calculating the probabilities.

#### Data Augmentation
We performed data augmentation by adding transliterated data. We removed emojis from text and then we used uroman1 to generate additional transliterated data of 219114 samples.

## A Survey of Toxic Comment Classification Methods

https://arxiv.org/pdf/2112.06412.pdf
### Preprocessing
Regex to implement text normalization\
Tokenizer class from Keras library to vectorize a text corpus\
Padding strategy with pad_sequences function from Keras
### Architecture
preprocessing&Feature engineering --> embedding layer --> concolutional layer --> feed-forward layer --> output layer
preprocessing&Feature engineering --> embedding layer --> LSTM layer --> output layer
Naive Bayes: 96.8%\
LSTM(with FastText): 99.4 %\
LSTM(with GloVe): 99.6 %\
CNN(with FastText): 99.4 %\
CNN(with GloVe): 99.5\


## Predicting Different Types of Subtle Toxicity in Unhealthy Online Conversations
https://arxiv.org/pdf/2106.03952.pdf
### Architecture
Convolutional Neural Network Long Short Term Memory (CNN-LSTM) with pre-trained word embeddings. We used Global Vectors for Word Representation (GloVe; Pennington et al., 2014b) to create an index of words mapped to known embeddings by parsing the data dump of pre-trained embeddings

## UIT-ISE-NLP at SemEval-2021 Task 5: Toxic Spans Detection with BiLSTM-CRF and ToxicBERT Comment Classification
https://arxiv.org/pdf/2104.10100.pdf
### Architecture
**Detection model:** BiLSTM-CRF is a deep neu- ral model used for Named-entity recognition task (Lample et al., 2016). We implement this model for the task of detecting toxic words in documents. The model includes three main layers: (1) The word representation layer uses embedding matrix from the GloVe word embedding, (2) The BiLSTM layer for sequence labeling, and (3) The Condi- tional Random Field (CRF) layer to control the probability of output labels.\
**Classification model:** The ToxicBERT model (Detoxify) is introduced by Hanu and Unitary team (2020) with the purpose to stop online abusive com- ments. It is a pre-trained model and is easy to use by using transformers library4. The model is trained on the Toxic Comments Classification Chal- lenge datasets provided by Jigsaw.\
Our system combines the detection and clas- sification model together. The detection model (BiLSTM-CRF) returns the toxic spans from the post, while the classification model (ToxicBERT) classifies whether a post is toxic or non-toxic.

## Is preprocessing of text really worth your time for toxic comment classification?
https://arxiv.org/pdf/1806.02908.pdf

### Models
1) Logistic regression, which is conventionally used in sentiment classification.
2) Naive Bayes with SVM (NBSVM), 
3) Extreme Gradient Boosting (XGBoost) and 
4) FastText algorithm with Bidirectional LSTM (FastText-BiLSTM).

The F1-score for negative class is somewhere around 0.8 for NBSVM and fastText-BiLSTM, for logit this value is around 0.74 and for XGBoost, the value is around 0.57. The fastText-BiLSTM and NBSVM performed consistently well for most of the transformations compared to the Logit and XGBoost.

## Empirical Analysis of Multi-Task Learning for Reducing Identity Bias in Toxic Comment Detection
https://arxiv.org/pdf/1909.09758.pdf

Embedding layer --> biLSTM layer --> attention layer (two fully connected layer)

## LONG RANGE ARENA: A BENCHMARK FOR EFFICIENT TRANSFORMERS
https://arxiv.org/pdf/2011.04006.pdf

**Results on Text Classification:** Byte-level classification is shown to be difficult and challenging especially when no pretraining or contextual embeddings are used. The best model only obtains 65.90 accuracy. The Linear Transformer performs well on this task, along with the Performer model. Contrary to the ListOps task, it seems like fast kernel-based models do well on this task.

# ****BiLSTM & Attention Layer****

## A Beginner’s Guide to Using Attention Layer in Neural Networks
https://analyticsindiamag.com/a-beginners-guide-to-using-attention-layer-in-neural-networks/

## Hands-On Guide to Bi-LSTM With Attention
https://analyticsindiamag.com/hands-on-guide-to-bi-lstm-with-attention/


## biLSTM for multilabel toxic comments
https://medium.com/analytics-vidhya/author-multi-class-text-classification-using-bidirectional-lstm-keras-c9a533a1cc4a

# ****Data Augmentation****

## Can We Achieve More with Less? Exploring Data Augmentation for Toxic Comment Classification
https://arxiv.org/pdf/2007.00875.pdf

# Text Augmentation for Neural Networks
http://ceur-ws.org/Vol-2268/paper11.pdf

# ****Benchmark model of Toxic comment****

## Kaggle 3rd Place Solution — Jigsaw Multilingual Toxic Comment Classification
https://towardsdatascience.com/kaggle-3rd-place-solution-jigsaw-multilingual-toxic-comment-classification-e36d7d194bfb

https://github.com/moizsaifee/kaggle-jigsaw-multilingual-toxic-comment-classification-3rd-place-solution

Multimodel (RoBERTa, RoBERTA++, RoBERTa MLM, Mono BER) and blending (weighted average)

## Video of the challenge winners:
https://www.youtube.com/watch?v=_-VeZU4JyBo

Basic Model:
* Word embeddings
* Two BiGru layers
* Two dense layers
* Output

a. Diverse pre-trained embeddings (FassTest & Glove)
b. Translationa as train/**test** augmentation
c. Pseudo labeling (trained test date with best ensemlbe than train on that)
d Robust CV + stacking frameworf (used a mix of arithmetic averaging and LightGBM)

Also: train on translation...

https://www.meetup.com/LearnDataScience/events/248699439/