In [6]:
%%html 
<style>
    .gr-hide {
        display:block;
        font-style: italic;
        color:#ccc;
    }
    .gr-remind {
        color: red;
    }
</style>

# Udacity Machine Learning Engineering Capstone Project Proposal

Author: Giuseppe Romagnuolo<br/>
Field: Natural Language Processing<br/>
Source: Kaggle competition - [Jigsaw unintended bias in toxicity classification](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview)<br/>
Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.<br/>


---------------------


## Domain Background

<p class="gr-hide">
Student briefly details background information of the domain from which the project is proposed. Historical information relevant to the project should be included. It should be clear how or why a problem in the domain can or should be solved. Related academic research should be appropriately cited. A discussion of the student's personal motivation for investigating a particular problem in the domain is encouraged but not required.
</p>

Natural Language Understanding is a complex subject which is hypothesised to be part of AI-complete set of problems, and as such cannot be solved with modern computer technology alone, but would also require human computation.[$^{[ref.Wikipedia]}$](https://en.wikipedia.org/wiki/AI-complete)

With over 90% of data ever generated being produced in the last 2 years[$^{[ref.ScienceDaily]}$](https://www.sciencedaily.com/releases/2013/05/130522085217.htm) and with a great proportion being human generated textual data there is an increase need to automatically manage this data.

Recent UK Government proposal to have measures to regulate social media companies over harmful content, including "substantial" fines and the ability to block services that do not stick to the rules is an example of the need to better manage the content that is being generated by users. [$^{[ref.BBC]}$](https://www.bbc.co.uk/news/technology-47135058)

Other initiatives like [Riot Games](https://www.riotgames.com/en)'s work aimed to predict and reform toxic player behaviour during games[$^{[ref.ArsTechnica]}$](https://arstechnica.com/gaming/2013/05/using-science-to-reform-toxic-player-behavior-in-league-of-legends/) is another example of this effort to understand and moderate toxic content.

However, machine only classification might suffer from unintended bias where models might predict high likelihood of toxicity for content containing certain words (e.g. "gay") even when those comments were not actually toxic (such as "I am a gay woman"), leaving  machine only classification models sub-standard and as a consequence penalise and restrict freedom of speech.
 
Recent advancement in NLP like [BERT](https://arxiv.org/pdf/1810.04805.pdf) and [Sequence classification with human attention](https://aclweb.org/anthology/K18-1030) have advanced machine language understanding and show promising result in helping creating better algorithms in the natural language understanding domain.


## Problem Statement

<p class="gr-hide">
Student clearly describes the problem that is to be solved. The problem is well defined and has at least one relevant potential solution. Additionally, the problem is quantifiable, measurable, and replicable.
</p>

The Conversation AI team, a research initiative founded by Jigsaw and Google (both part of Alphabet), builds technology to protect voices in conversation. A main area of focus is machine learning models that can identify toxicity in online conversations, where toxicity is defined as anything rude, disrespectful or otherwise likely to make someone leave a discussion.

Last year, in the Toxic Comment Classification Challenge, participants built multi-headed models to recognize toxicity and several subtypes of toxicity. This year's competition is a related challenge: building toxicity models that operate fairly across a diverse range of conversations.

Here’s the background: When the Conversation AI team first built toxicity models, they found that the models incorrectly learned to associate the names of frequently attacked identities with toxicity. Models predicted a high likelihood of toxicity for comments containing those identities (e.g. "gay"), even when those comments were not actually toxic (such as "I am a gay woman"). This happens because training data was pulled from available sources where unfortunately, certain identities are overwhelmingly referred to in offensive ways. Training a model from data with these imbalances risks simply mirroring those biases back to users.

In this competition, you're challenged to build a model that recognizes toxicity and minimizes this type of unintended bias with respect to mentions of identities. You'll be using a dataset labeled for identity mentions and optimizing a metric designed to measure unintended bias. Develop strategies to reduce unintended bias in machine learning models, and you'll help the Conversation AI team, and the entire industry, build models that work well for a wide range of conversations.[$^{[cit.Kaggle]}$](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview)

## Datasets and Inputs

<p class="gr-hide">
The dataset(s) and/or input(s) to be used in the project are thoroughly described. Information such as how the dataset or input is (was) obtained, and the characteristics of the dataset or input, should be included. It should be clear how the dataset(s) or input(s) will be used in the project and whether their use is appropriate given the context of the problem.
</p>



## Solution Statement

<p class="gr-hide">
Student clearly describes a solution to the problem. The solution is applicable to the project domain and appropriate for the dataset(s) or input(s) given. Additionally, the solution is quantifiable, measurable, and replicable.
</p>



To overcome unintented model bias I will explore building an ensamble of Deep Neural network classifiers, where each tries to maximise its score within its identity.



## Benchmark Model

<p class="gr-hide">
A benchmark model is provided that relates to the domain, problem statement, and intended solution. Ideally, the student's benchmark model provides context for existing methods or known information in the domain and problem given, which can then be objectively compared to the student's solution. The benchmark model is clearly defined and measurable.
</p>

## Evaluation Metrics

<p class="gr-hide">
Student proposes at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model presented. The evaluation metric(s) proposed are appropriate given the context of the data, the problem statement, and the intended solution.
</p>



This competition uses a newly developed metric that combines several submetrics to balance overall performance with various aspects of unintended bias.

Please refer to [evaluation section](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview/evaluation) of the competition and the provided [benchmark kernel](https://www.kaggle.com/dborkan/benchmark-kernel) with code to calculate the competition evaluation metrics.

Here are defined the submetrics:

### Overall AUC
This is the ROC-AUC for the full evaluation set.


### Bias AUCs

To measure unintended bias, we again calculate the ROC-AUC, this time on three specific subsets of the test set for each identity, each capturing a different aspect of unintended bias. You can learn more about these metrics in Conversation AI's recent paper *[Nuanced Metrics for Measuring Unintended Bias with Real Data in Text Classification](https://arxiv.org/abs/1903.04561)*.

**Subgroup AUC:** Here, we restrict the data set to only the examples that mention the specific identity subgroup. *A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity*.

**BPSN (Background Positive, Subgroup Negative) AUC:** Here, we restrict the test set to the non-toxic examples that mention the identity and the toxic examples that do not. *A low value in this metric means that the model confuses non-toxic examples that mention the identity with toxic examples that do not*, likely meaning that the model predicts higher toxicity scores than it should for non-toxic examples mentioning the identity.

**BNSP (Background Negative, Subgroup Positive) AUC:** Here, we restrict the test set to the toxic examples that mention the identity and the non-toxic examples that do not. *A low value here means that the model confuses toxic examples that mention the identity with non-toxic examples that do not*, likely meaning that the model predicts lower toxicity scores than it should for toxic examples mentioning the identity.

### Generalized Mean of Bias AUCs

To combine the per-identity Bias AUCs into one overall measure, we calculate their generalized mean as defined below:


$M_p(m_s)=( \frac{1}{N} \sum_{s=1}^{N}m_{s}^{p} )^{\frac{1}{p}}$

where:

MpMp = the ppth power-mean function\
msms = the bias metric mm calulated for subgroup ss\
NN = number of identity subgroups

For this competition, we use a pp value of -5 to encourage competitors to improve the model for the identity subgroups with the lowest model performance.

### Final Metric

We combine the overall AUC with the generalized mean of the Bias AUCs to calculate the final model score:

$score=w_0AUC_{overall}+\sum_{a=1}^{A}w_aM_p(m_{s,a})$

where:

- $A$ = number of submetrics (3)
- $ms_{s,a}$ = bias metric for identity subgroup ss using submetric $a$
$w_a$ = a weighting for the relative importance of each submetric; all four $w$ values set to 0.25

While the leaderboard will be determined by this single number, we highly recommend looking at the individual submetric results, [as shown in this kernel](https://www.kaggle.com/dborkan/benchmark-kernel), to guide you as you develop your models.

---
## Project Design


<p class="gr-hide">
Student summarizes a theoretical workflow for approaching a solution given the problem. Discussion is made as to what strategies may be employed, what analysis of the data might be required, or which algorithms will be considered. The workflow and discussion provided align with the qualities of the project. Small visualizations, pseudocode, or diagrams are encouraged but not required.
</p>

Following are the steps I'm planning to take to build a solid classification algorithm:

1. Exploring data and collecting key metrics
2. Definition of evaluation metrics
3. Transformation
4. Model selection and building
5. Training, evaluation and fine tuning the model
6. Kaggle submission

### 1. Exploring data and collecting key metrics

First and foremost I will explore the dataset, during this phase I will gather various statistics like:

- Number of samples in the training set
- Distribution of toxic/non-toxic in the training set
- Distribution of toxic/non-toxic in the training set for each identity
- Number of samples with no identity set
- Distribution of toxic/non-toxic in the training set for each sub-group
- Number of words per sample
- Frequency distribution of words
- Distribution of sample length
- Most frequent words

### 2. Definition of evaluation metrics

It is important to have defined a method of evaluation from the very beginning and the Kaggle competition provides  what are the evaluation metrics used.

However I will want to being able to calculate the score offline without the need to submit the code so I'll want to setup code that can calculate the evaluation metrics offline.

### 3. Transformation

The dataset will need to be transformed before it can be submitted to a model, this requires the removal of stopwords, identification of punctuation (with consideration of punctuation when used to obfuscate offensive words), lemmitisation and tokenisation. Lastly I will transform words into embeddings evaluating libraries like **Word2Vec**, **GloVe** and **FastText**.

One thing to note is that Word2Vec and GloVe only learns vectors for completed words found in the training corpus which means that words that are misspelled or that are camuflaged using special characters like $#!* won't be assigned a vector. 

On the other hand FastText breaks words into several n-grams (sub-words). For instance, the tri-grams for the word apple is app, ppl, and ple. Rare words can now be properly represented since it is highly likely that some of their n-grams also appears in other words. [$^{[cit.Medium]}$](https://medium.com/@jatinmandav3/opinion-mining-sometimes-known-as-sentiment-analysis-or-emotion-ai-refers-to-the-use-of-natural-874f369194c0)


### 4. Model selection and building

An [earlier competition](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) from Jigsaw saw participants classify toxic/non-toxic comments however the models built then fell short when it came to eliminate unintended bias, e.g. they gave a  high likelihood of toxicity for comments containing certain identities (e.g. "gay"), even when those comments were not actually toxic (such as "I am a gay woman"). This happened because training data was pulled from available sources where certain identities are overwhelmingly referred to in offensive ways.[[$^{cit.Kaggle}$]](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification)

I will have to minimise this bias and evaluate a strategy able to classify with the same high level of accuracy samples belonging to [different identities](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data).

This might require evaluating the use of an **ensamble** where each classifier is optimised on the classification of a particular identity. The likelyhood of a comment being toxic would then be based on a weighted average of each single classifier where the weight change based on the identity of the message.

One challenge will be that not all data in the traning set will have been tagged with identities and therefore, in order to adjust the weight of the different classifier depending on the comment's identity, I will need find a way to predict the identity a comment is most likely to have.

There might be the need to have a pipeline of classifiers, a first classifier will predict the identity of each comment, this will become a feature for the ensamble of classifiers and used to calculate the weight for their score based on the predicted identity of the comment.

There will be different models suitable for classification and during this phase I will have to evaluate the ML algorithm to use. Deep Neural Networks like CNN and LSTM are some that I'm considering to evaluate. 

**Convolutional neural network** are suitable when data has positional information.

**Long Short Term Memory neural network**, like RNNs are designed to use sequential data, when the current step has some kind of relation with the previous steps and they are designed to remember things in the long term. [$^{[cit.Quora]}$](https://www.quora.com/Where-does-each-type-of-neural-network-RNN-CNN-LSTM-etc-excel)

While sentiment analysis might not need information of position of words, I believe considering the position of words will improve the accuracy across all the [different identities](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data) and will help with unintended bias (e.g. "is an offensive word being use to quote someone else's comment" or "is the tone demeaning without the use of any particular offensive word" etc.)


### 5. Training, evaluation and fine tuning the model

Having defined a suitable model, probably the hardest part will be evaluation and fine tuning of the model.
Techniques like Grid-search can help choose more suitable hyperparameters however this process is painstakingly long.

### 6. Kaggle submission

It is tempting to submit to Kaggle frequently to evaluate the performance against Jigsaw's scoring and optimise the model against this score.

There is various evidence that this is not a winning strategy, in fact there is a chance of slowly overfitting the model on this test set, it might become a good predictor of the test set but it will not be able to generalise. [$^{[ref.Kaggle]}$](http://blog.kaggle.com/2012/07/06/the-dangers-of-overfitting-psychopathy-post-mortem/)

As such I shall refrain from making too many submission or being too fixated on the Kaggle leaderboard score.