**Task: Identifying sentences susceptible to machine translation bias**

**Team: Aleksandra Konovalova, Lukas Felser, Laura König**

The idea was to create solution that could be potentially language-independent. We tested several approaches:
- Logistic Regression (sklearn)
- NER and dependency trees (spacy)
- pretrained Word Embeddings (gensim, just a quick look)

In [3]:
import numpy as np
import pandas as pd 

import string

In [4]:
df = pd.read_csv('toydata.tsv', sep='\t', header=None, names=['label', 'source','en','de','es'],error_bad_lines=False)

Some lines were bad (more columns than expected), so we skipped them.

In [5]:
df

Unnamed: 0,label,source,en,de,es
0,1,WikiBio,So Tara Devi received informal tuition at home.,So Tara Devi erhielt zu Hause eine informelle ...,"Debido a esto, recibía clases informales en su..."
1,1,WikiBio,Chandrima Shaha (born 14 October 1952) is an ...,Chandrima Shaha (geboren am 14. Oktober 1952) ...,Chandrima Shaha (nacida el 14 de octubre de 19...
2,0,WMT13,"Moreover, it should be noted that the country ...","Außerdem wird darauf hingewiesen, dass sich da...","Más aún, se señala que el país se plegó a los ..."
3,1,WikiBio,"In 2019, Forbes ranked her the 8th most powerf...",2019 setzte Forbes sie auf Platz 8 der mächtig...,"En 2019, Forbes la reconoció como la octava mu..."
4,1,WikiBio,He has also served as executive director of th...,Er fungierte auch als Executive Director des C...,También brindó servicios como director ejecuti...
...,...,...,...,...,...
2915,1,WikiBio,"Upon winning a seat, she was dismissed from he...",Nach dem Gewinn des Sitzes wurde sie von ihrer...,"Después de ocupar el escaño, dejó su cargo en ..."
2916,1,WikiBio,Ruth-Rolland's subsequent detainment at a poli...,Gegen die anschließende Inhaftierung von Ruth-...,Amnistía Internacional se opuso a la detención...
2917,0,WMT13,Baku became famous in May due to the Eurovisio...,Baku wurde im Mai mit dem Eurovision-Song-Fest...,Baku se hizo famosa en mayo con motivo de la c...
2918,0,WMT13,"Furthermore, these laws also reduce early voti...",Darüber hinaus werden durch diese Gesetze eben...,"Por otra parte, esas leyes reducen los período..."


In [6]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

In [7]:
corpus_en = df['en'].to_numpy()
corpus_de = df['de'].to_numpy()
corpus_es = df['es'].to_numpy()

In [8]:
vectorizer_en = TfidfVectorizer()

In [10]:
X_en = vectorizer_en.fit_transform(corpus_en)
y = df['label'].to_numpy()

In [11]:
X_train_en, X_test_en, y_train, y_test = train_test_split(X_en, y, test_size=0.33, random_state=42)

In [12]:
from sklearn.linear_model import LogisticRegression

In [13]:
regr_en = LogisticRegression(random_state=0, solver='liblinear', multi_class='ovr')

In [14]:
regr_en.fit(X_train_en,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='ovr', n_jobs=None, penalty='l2', random_state=0,
                   solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [23]:
y_pred_en = np.round(regr_en.predict(X_test_en))

In [28]:
wrong_predictions = (y_test != y_pred_en).nonzero()

In [29]:
len(wrong_predictions[0])

113

Let's look at the sentences that were labelled incorrectly as sentences that don't have any person or people in them.

In [30]:
bad_predictions = []
for i in range(len(corpus_en)):
  prediction = regr_en.predict(vectorizer_en.transform([corpus_en[i]]))
  if prediction == [0] and df['label'][i] == 1:
    bad_predictions.append(corpus_en[i])

bad_predictions

['This patient said to me after I replaced his second hip using the PATH® Technique that it made a huge difference.',
 "Bjørn has also created five Hans Christian Andersen ballets for the Pantomime Theatre in Copenhagen's Tivoli.",
 'After this operation, the patients are completely free from pain in the hip joint but their hip joint is also completely stiff.',
 'This is the current Asian record.',
 'I began to experiment with a low-fat diet that I devised for myself.',
 'Luciana Aymar is the only player that has participated and won those four medals. ',
 "It's all right for us - and our children - to be bored on occasion, they say.",
 'Using such scales, researchers have discovered that boys tend to be bored more often than girls, said Stephen Vodanovich, a professor of psychology at the University of West Florida, especially when it comes needing more, and a variety of, external stimulation.',
 'Your doctor may ask you to use the results to adjust the amount of drugs you to take to 

Some of the sentences have a person name in them, so NER will probably help with them. Other mention jobs (doctor, salesmen, etc) or nationalities or clearly animate entities (adults, women), so the identification of sentences with such animate entities should help. I think that sentences with 1-person pronouns can also be automatically identified (the situation with other pronouns may differ depending on the language)

In [21]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

Let's take a look at metrics and other languages available.

In [22]:
accuracies = dict()
precisions = dict()
recalls = dict()

In [31]:
acc_en = accuracy_score(y_test, y_pred_en)
accuracies['acc_en'] = acc_en
acc_en

0.8827800829875518

In [32]:
prec_en = precision_score(y_test, y_pred_en)
precisions['prec_en'] = prec_en
prec_en

0.9146341463414634

In [33]:
rec_en = recall_score(y_test, y_pred_en)
recalls['rec_en'] = rec_en
rec_en

0.8637236084452975

In [37]:
vectorizer_de = TfidfVectorizer()
X_de = vectorizer_de.fit_transform(corpus_de)
X_train_de, X_test_de, y_train, y_test = train_test_split(X_de, y, test_size=0.33, random_state=42)
regr_de = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr')
regr_de.fit(X_train_de,y_train)
y_pred_de = np.round(regr_de.predict(X_test_de))

In [38]:
acc_de = accuracy_score(y_test, y_pred_de)
accuracies['acc_de'] = acc_de
acc_de

0.8578838174273858

In [39]:
prec_de = precision_score(y_test, y_pred_de)
precisions['prec_de'] = prec_de
prec_de

0.8902439024390244

In [40]:
rec_de = recall_score(y_test, y_pred_de)
recalls['rec_de'] = rec_de
rec_de

0.8406909788867563

In [41]:
vectorizer_es = TfidfVectorizer()
X_es = vectorizer_es.fit_transform(corpus_es.astype('U'))
X_train_es, X_test_es, y_train_es, y_test_es = train_test_split(X_es, y, test_size=0.33, random_state=42)
regr_es = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr')
regr_es.fit(X_train_es,y_train_es)
y_pred_es = np.round(regr_es.predict(X_test_es))

In [42]:
acc_es = accuracy_score(y_test, y_pred_es)
accuracies['acc_es'] = acc_es
acc_es

0.8298755186721992

In [43]:
prec_es = precision_score(y_test, y_pred_es)
precisions['prec_es'] = prec_es
prec_es

0.86652977412731

In [44]:
rec_es = recall_score(y_test, y_pred_es)
recalls['rec_es'] = rec_es
rec_es

0.8099808061420346

In [45]:
accuracies

{'acc_de': 0.8578838174273858,
 'acc_en': 0.8827800829875518,
 'acc_es': 0.8298755186721992}

In [46]:
precisions

{'prec_de': 0.8902439024390244,
 'prec_en': 0.9146341463414634,
 'prec_es': 0.86652977412731}

In [47]:
recalls

{'rec_de': 0.8406909788867563,
 'rec_en': 0.8637236084452975,
 'rec_es': 0.8099808061420346}

I've also decided to look at Gensim pretrained embeddings to see whether there is a similarity between words that are related to the animate entities.

In [49]:
import gensim.downloader as api

In [50]:
model_glove_twitter = api.load("glove-twitter-25")



In [51]:
print(model_glove_twitter)

<gensim.models.keyedvectors.Word2VecKeyedVectors object at 0x7f52b6c898d0>


In [52]:
model_glove_twitter.wv.most_similar('people', topn = 20)

  """Entry point for launching an IPython kernel.


[('other', 0.9613626003265381),
 ('those', 0.9485428929328918),
 ('ones', 0.9484484195709229),
 ('especially', 0.9475629329681396),
 ('reason', 0.9441413283348083),
 ('lot', 0.9433570504188538),
 ('things', 0.9403640031814575),
 ('except', 0.9374849796295166),
 ('they', 0.9373652935028076),
 ('there', 0.935519814491272),
 ('because', 0.9352456331253052),
 ('about', 0.9349503517150879),
 ("'re", 0.9346608519554138),
 ('friends', 0.9337660074234009),
 ('many', 0.9336183667182922),
 ('understand', 0.9331066608428955),
 ('either', 0.933001697063446),
 ('how', 0.93278968334198),
 ('unless', 0.9312679767608643),
 ('them', 0.9289588332176208)]

In [53]:
model_glove_twitter.wv.most_similar('man', topn = 20)

  """Entry point for launching an IPython kernel.


[('was', 0.9065622091293335),
 ('i', 0.8880172371864319),
 ('he', 0.8874381184577942),
 ('bad', 0.8846144080162048),
 ('even', 0.8832389116287231),
 ('be', 0.8784030079841614),
 ('we', 0.8764979839324951),
 ('not', 0.8764553666114807),
 ('had', 0.8762109279632568),
 ('glad', 0.8758710622787476),
 ('is', 0.8737497925758362),
 ('am', 0.8718030452728271),
 ('so', 0.8705556988716125),
 ('men', 0.8705453872680664),
 ('wo', 0.8682353496551514),
 ('rest', 0.8665797710418701),
 ('sad', 0.8657856583595276),
 ('fucking', 0.8622854351997375),
 ('over', 0.8621624112129211),
 ('an', 0.8621268272399902)]

In [59]:
model_glove_twitter.wv.most_similar('woman', topn = 20)

  """Entry point for launching an IPython kernel.


[('child', 0.9371739625930786),
 ('mother', 0.9214695692062378),
 ('whose', 0.9174973964691162),
 ('called', 0.9146500825881958),
 ('person', 0.9135538339614868),
 ('wife', 0.9088311195373535),
 ('being', 0.9037442803382874),
 ('father', 0.9028053283691406),
 ('guy', 0.9026351571083069),
 ('known', 0.8997253179550171),
 ('who', 0.893405556678772),
 ('women', 0.8889436721801758),
 ('born', 0.8853096961975098),
 ('become', 0.8848217725753784),
 ('children', 0.8835378289222717),
 ('virgin', 0.8829305171966553),
 ('husband', 0.8827938437461853),
 ('human', 0.87855464220047),
 ('female', 0.8775132298469543),
 ('rich', 0.8770859837532043)]

In [58]:
model_glove_twitter.wv.most_similar('child', topn = 20)

  """Entry point for launching an IPython kernel.


[('mother', 0.9377204775810242),
 ('woman', 0.9371739625930786),
 ('father', 0.9260582327842712),
 ('children', 0.9258805513381958),
 ('called', 0.9033197164535522),
 ('death', 0.8979307413101196),
 ('wife', 0.896138072013855),
 ('birth', 0.8957393169403076),
 ('whose', 0.8928542733192444),
 ('daughter', 0.8889879584312439),
 ('slave', 0.8850050568580627),
 ('human', 0.8845320343971252),
 ('self', 0.8805544376373291),
 ('husband', 0.8750180006027222),
 ('every', 0.8702450394630432),
 ('known', 0.8701492547988892),
 ('person', 0.8693212270736694),
 ('kids', 0.8676499128341675),
 ('age', 0.8642564415931702),
 ('born', 0.8618560433387756)]

In [54]:
model_glove_twitter.wv.most_similar('doctor', topn = 20)

  """Entry point for launching an IPython kernel.


[('doc', 0.8603223562240601),
 ('has', 0.8328831195831299),
 ('found', 0.8258830904960632),
 ('actual', 0.8237391114234924),
 ('dad', 0.8182487487792969),
 ('ted', 0.8157984018325806),
 ('ron', 0.8132396340370178),
 ('teacher', 0.8123688101768494),
 ('idea', 0.8085229992866516),
 ('brother', 0.8054851293563843),
 ('spanish', 0.8051372170448303),
 ('bell', 0.8041635751724243),
 ('called', 0.8024425506591797),
 ('mary', 0.8015022277832031),
 ('mr.', 0.8011877536773682),
 ('father', 0.7991638779640198),
 ('comes', 0.7976055145263672),
 ('george', 0.7950779795646667),
 ('john', 0.7936967611312866),
 ('director', 0.7921426892280579)]

In [57]:
model_glove_twitter.wv.most_similar('student', topn = 20)

  """Entry point for launching an IPython kernel.


[('students', 0.9311226010322571),
 ('senior', 0.9288715720176697),
 ('group', 0.9277381300926208),
 ('staff', 0.9138274192810059),
 ('primary', 0.911514937877655),
 ('private', 0.9070918560028076),
 ('college', 0.897129237651825),
 ('office', 0.8945457935333252),
 ('law', 0.8890551328659058),
 ('department', 0.8884310126304626),
 ('job', 0.8838340640068054),
 ('schools', 0.8792070746421814),
 ('teachers', 0.8772590160369873),
 ('research', 0.8759735822677612),
 ('business', 0.875503420829773),
 ('education', 0.8738331198692322),
 ('graduate', 0.8713834881782532),
 ('board', 0.8706340193748474),
 ('program', 0.8698241710662842),
 ('faculty', 0.8693956136703491)]

Some words appear to have similarities to other animate entities (e.g. *doctor* or *child*), but such words as *people* and *man* don't seem to have (compare with *woman*). For *man* the problem can be that this word is also used as interjection, so the context is more broad than the one for *woman*. However, the bias regarding women can clearly be seen even at this point.