In [4]:
%%html 
<style>
    .gr-hide {
        display:block;
        font-style: italic;
        color:#ccc;
    }
    .gr-remind {
        color: red;
    }
</style>

# Udacity Machine Learning Engineering Capstone Project Proposal

Author: Giuseppe Romagnuolo

Kaggle competition: [Jigsaw unintended bias in toxicity classification](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview)

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.


---------------------


## Domain Background

<p class="gr-hide">
Student briefly details background information of the domain from which the project is proposed. Historical information relevant to the project should be included. It should be clear how or why a problem in the domain can or should be solved. Related academic research should be appropriately cited. A discussion of the student's personal motivation for investigating a particular problem in the domain is encouraged but not required.
</p>

Natural Language Understanding is a complex subject which is hypothesised to be part of AI-complete set of problems, and as such cannot be solved with modern computer technology alone, but would also require human computation.[$^{[ref.Wikipedia]}$](https://en.wikipedia.org/wiki/AI-complete)

Riot Games use Data Bricks and ML to identify hate speech during games 
https://arstechnica.com/gaming/2013/05/using-science-to-reform-toxic-player-behavior-in-league-of-legends/2/

https://developers.google.com/machine-learning/crash-course/

## Problem Statement

<p class="gr-hide">
Student clearly describes the problem that is to be solved. The problem is well defined and has at least one relevant potential solution. Additionally, the problem is quantifiable, measurable, and replicable.
</p>

When the Conversation AI team first built toxicity models, they found that the models incorrectly learned to associate the names of frequently attacked identities with toxicity. Models predicted a high likelihood of toxicity for comments containing those identities (e.g. "gay"), even when those comments were not actually toxic (such as "I am a gay woman"). This happens because training data was pulled from available sources where unfortunately, certain identities are overwhelmingly referred to in offensive ways.[$^{[ref.Kaggle]}$](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview)

## Datasets and Inputs

<p class="gr-hide">
The dataset(s) and/or input(s) to be used in the project are thoroughly described. Information such as how the dataset or input is (was) obtained, and the characteristics of the dataset or input, should be included. It should be clear how the dataset(s) or input(s) will be used in the project and whether their use is appropriate given the context of the problem.
</p>

## Solution Statement

<p class="gr-hide">
Student clearly describes a solution to the problem. The solution is applicable to the project domain and appropriate for the dataset(s) or input(s) given. Additionally, the solution is quantifiable, measurable, and replicable.
</p>



To overcome unintented model bias I will explore building an ensamble of Deep Neural network classifiers, where each tries to maximise its score within its identity.



## Benchmark Model

<p class="gr-hide">
A benchmark model is provided that relates to the domain, problem statement, and intended solution. Ideally, the student's benchmark model provides context for existing methods or known information in the domain and problem given, which can then be objectively compared to the student's solution. The benchmark model is clearly defined and measurable.
</p>

## Evaluation Metrics

<p class="gr-hide">
Student proposes at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model presented. The evaluation metric(s) proposed are appropriate given the context of the data, the problem statement, and the intended solution.
</p>



This competition uses a newly developed metric that combines several submetrics to balance overall performance with various aspects of unintended bias.

Please refer to [evaluation section](https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview/evaluation) of the competition and the provided [benchmark kernel](https://www.kaggle.com/dborkan/benchmark-kernel) with code to calculate the competition evaluation metrics.

Here are defined the submetrics:

### Overall AUC
This is the ROC-AUC for the full evaluation set.


### Bias AUCs

To measure unintended bias, we again calculate the ROC-AUC, this time on three specific subsets of the test set for each identity, each capturing a different aspect of unintended bias. You can learn more about these metrics in Conversation AI's recent paper *[Nuanced Metrics for Measuring Unintended Bias with Real Data in Text Classification](https://arxiv.org/abs/1903.04561)*.

**Subgroup AUC:** Here, we restrict the data set to only the examples that mention the specific identity subgroup. *A low value in this metric means the model does a poor job of distinguishing between toxic and non-toxic comments that mention the identity*.

**BPSN (Background Positive, Subgroup Negative) AUC:** Here, we restrict the test set to the non-toxic examples that mention the identity and the toxic examples that do not. *A low value in this metric means that the model confuses non-toxic examples that mention the identity with toxic examples that do not*, likely meaning that the model predicts higher toxicity scores than it should for non-toxic examples mentioning the identity.

**BNSP (Background Negative, Subgroup Positive) AUC:** Here, we restrict the test set to the toxic examples that mention the identity and the non-toxic examples that do not. *A low value here means that the model confuses toxic examples that mention the identity with non-toxic examples that do not*, likely meaning that the model predicts lower toxicity scores than it should for toxic examples mentioning the identity.

### Generalized Mean of Bias AUCs

To combine the per-identity Bias AUCs into one overall measure, we calculate their generalized mean as defined below:


$M_p(m_s)=( \frac{1}{N} \sum_{s=1}^{N}m_{s}^{p} )^{\frac{1}{p}}$

where:

MpMp = the ppth power-mean function\
msms = the bias metric mm calulated for subgroup ss\
NN = number of identity subgroups

For this competition, we use a pp value of -5 to encourage competitors to improve the model for the identity subgroups with the lowest model performance.

### Final Metric

We combine the overall AUC with the generalized mean of the Bias AUCs to calculate the final model score:

$score=w_0AUC_{overall}+\sum_{a=1}^{A}w_aM_p(m_{s,a})$

where:

- $A$ = number of submetrics (3)
- $ms_{s,a}$ = bias metric for identity subgroup ss using submetric $a$
$w_a$ = a weighting for the relative importance of each submetric; all four $w$ values set to 0.25

While the leaderboard will be determined by this single number, we highly recommend looking at the individual submetric results, [as shown in this kernel](https://www.kaggle.com/dborkan/benchmark-kernel), to guide you as you develop your models.

## Project Design

<p class="gr-hide">
Student summarizes a theoretical workflow for approaching a solution given the problem. Discussion is made as to what strategies may be employed, what analysis of the data might be required, or which algorithms will be considered. The workflow and discussion provided align with the qualities of the project. Small visualizations, pseudocode, or diagrams are encouraged but not required.
</p>

**1. Exploring data and collecting key metrics**

First and foremost I will explore the dataset, during this phase I will gather various statistics like:

- Number of samples in the training set
- Distribution of toxic/non-toxic in the training set
- Distribution of toxic/non-toxic in the training set for each identity
- Number of samples with no identity set
- Distribution of toxic/non-toxic in the training set for each sub-group
- Number of words per sample
- Frequency distribution of words
- Distribution of sample length
- Most frequent words

**2. Definition of evaluation metrics**

It is important to have defined a method of evaluation from the very beginning and the Kaggle competition provides  what are the evaluation metrics used.

However I will want to being able to calculate the score offline without the need to submit the code so I'll want to setup code that can calculate the evaluation metrics offline.

**3. Transformation**

The dataset will need to be transformed before it can be submitted to a model, this reqiures the removal of stopwords, identification of punctuation with consideration of punctuation when used to obfuscate offensive words) and comments tokenisation. Lastly I will transform words into embeddings evaluating libraries like Word2Vec, GloVe and FastText.

**4. Model selection and building**

There will be different models suitable for classification and during this phase I will have to evaluate various models to select from.

A previous competition <span class="gr-remind">Link required</span> from Jigsaw saw participants classify toxic/non-toxic comments however the models built then fell short when it came to eliminate unintended bias, e.g. they gave a  high likelihood of toxicity for comments containing certain identities (e.g. "gay"), even when those comments were not actually toxic (such as "I am a gay woman"). This happened because training data was pulled from available sources where certain identities are overwhelmingly referred to in offensive ways.

I will have to minimise this bias and evaluate a strategy able to classify correctly regardless of the skewness of the data.

This might require evaluating the use of an ensamble where I have various classifiers that have been optimised in the classification of a particular identity. I believe it would be best to calculate the likelyhood of a comment being toxic based on a weighted average of each single classifier where the weight change based on the identity.

However not all data in the traning set (not the test set) will have been tagged with identities and therefore, in order to adjust the weight of the different classifier depending on the comment's identity, I will need find a way to predict the comments most likely identity.

This suggest to me that I will need to be build a pipeline of classifiers, a first classifier that predict the identity of the comment, submit the data to the ensamble of classifier and then weight their score based on the predicted identity of the comment.

**5. Train and evaluate the model

At this point I will train the models and evaluate 


FastText FastText 
Word2Vec and GloVe only learns vectors for completed words found the training corpus which means that words that are misspelled or are that are camuflaged using special characters like $#!* won't be assigned a vector.

## Presentation

<p class="gr-hide">
Proposal follows a well-organized structure and would be readily understood by its intended audience. Each section is written in a clear, concise and specific manner. Few grammatical and spelling mistakes are present. All resources used and referenced are properly cited.
</p>