Buyl et al propose that focusing on algorithmic fairness is a more efficient way to train models on biased data. The authors focus on network embeddings which are similar to word embeddings commonly used in Natural Language Processing. Chang et al have a deep dive into the workings of network embeddings when it comes to biomedical data. In brief, a biomedical object (eg. drug, protein etc) is mapped onto a node which is in turn mapped onto a network in multi-dimensional space. Similar nodes are mapped onto nearby points which is very useful for classification and clustering.
Kang et al describe conditional network embeddings as refined network embeddings which take into account prior knowledge about the structural properties of a network. Buyl et al describe some common types of prior knowledge:
Buyl et al go a step further in adapting the conditional network embedding by weighting data that might lead to discrimination more heavily. This ensures that the embedding takes into account the underlying bias in the data regardless of whether the prior takes into account this knowledge. This emphasis on adding a probabilistic view to the model ensures that the person building the model doesn't take the data at face value.
It shouldn't surprise you that many of these biases have existed long before everyone thought using algorithms to determine where British children would be placed for further studies was a good idea. What is surprising is the blind trust in these new algorithmic tools to be paragons of fairness given that they're trained using data that our biased systems have produced. Given that pretty much all data should be assumed to have one bias or another, it seems critical for Data Analysts and Data Scientists to be aware of how to include bias correction in their work-flow.
One very low hanging fruit in bias correction should be to foster collaboration and interaction with subject matter experts from early on in the process. A sociologist will (hopefully) tie in a host of socio-economic factors to why a pupil is underperforming according to algorithm A. This should lead the designers to iteratively improve the algorithm's predictive ability to take into account these factors.
Over the next couple of posts, I hope to run a journal club exercise (of one) on a review article published by Maarten Buyl and Tijl De Bie on 6th March 2020 in arxiv regarding a method for building a less biased predictor. In the article, the authors provide an overview of more technical strategies for bias correction (with a focus on the second one) including:
Most work on sentiment analysis has been done on short-form text data such as those derived from micro-blogging social media sites. Because of the succinct nature of these social media posts, it is perhaps easier to glean points of view than from longer pieces of text.
Spurred by my penchant for learning new languages, I decided to perform sentiment parsing of long-form text in the form of book reviews. The aim was to see if different models are as capable in situations where the user leaves a more verbose review. Book cataloging websites might benefit from insights gained from this project in order to improve their recommendation system. If longer reviews do result in less reliable recommendations the website might decide to implement a word limit on reviews.
- EDA and Preprocessing
- Milestone Report
- Bernoulli Naive Bayes Model
- Long Short Term Memory (LSTM) Neural Network
- Simple Recurrent Neural Network (simple RNN)
- TL;DR Slides
The LSTM neural network had the best performance on accuracy and AUC score followed by the Bernoulli Naive Bayes and the simple RNN neural network.
Removing non-English text from my dataset resulted in a smaller corpus to train on than I intended therefore it's impossible to make a definitive answer to my original hypothesis. Despite the commendable performance of the LSTM, the simple RNN perhaps could benefit from having a larger training corpus in order to improve performance.