# Retrieving Gender Information in Nouns & Adjectives using Minimum Intra-class STD + Maximum extra-class average diff
In this Notebook : attempts to isolate Word Embeddings dimension coding for **gender information** (masculine/feminine) in a sample of Word Embeddings for **NOUNS**, **ADJECTIVES** and **both**.  

## 0. Data Loading

In [1]:
import pandas as pd

# WE loading
nouns = pd.read_csv('../Data/FlauBERT_WE/all_nouns_we.csv', index_col=0).drop(columns=["number", "gender"])
nouns['verb'] = 0
adjs = pd.read_csv('../Data/FlauBERT_WE/all_adjectives_we.csv', index_col=0).drop(columns = ["number", "gender"])
adjs['verb'] = 0
verbs = pd.read_csv('../Data/FlauBERT_WE/all_verb_we.csv', index_col=0)
verbs['verb'] = 1
all_we = pd.concat([nouns, adjs, verbs])

# Normalization
normalized_data = (all_we - all_we.min())/(all_we.max() - all_we.min())

## 1. Lowest intra-class standard deviation

Separate the data into two classes: noun and not noun.

In [2]:
verb_norm = normalized_data[normalized_data["verb"] == 1]
not_verb_norm = normalized_data[normalized_data["verb"] == 0]

Lowest std for class **noun**:

In [3]:
verb_norm.loc[:,:'511'].std().sort_values()[:10]

69     0.096868
422    0.097616
408    0.097954
243    0.098164
157    0.098201
277    0.098976
314    0.099588
441    0.099668
293    0.100247
287    0.100441
dtype: float64

Lowest std for class **not_noun**:

In [4]:
not_verb_norm.loc[:,:'511'].std().sort_values()[:10]

157    0.099458
441    0.100394
287    0.102451
209    0.102641
277    0.102647
365    0.103298
69     0.103548
421    0.104138
60     0.104772
25     0.105097
dtype: float64

## 2. Dimensions with greatest difference between avg_noun and avg_not_noun

This seems encouraging, as few dimensions present the highest differences for each of the considered datasets. We also see higher differences in adjectives.

In [5]:
# top 10 dimensions with the highest difference
abs(verb_norm.loc[:, :'511'].mean() - not_verb_norm.loc[:, :'511'].mean()).sort_values(ascending=False)[:10]

310    0.073455
159    0.072571
480    0.067617
401    0.066168
89     0.061795
158    0.059015
282    0.058057
192    0.057955
504    0.057734
458    0.057161
dtype: float64

In [7]:
abs(verb_norm.loc[:, :'511'].mean() - not_verb_norm.loc[:, :'511'].mean()).sort_values(ascending=False)[:10].index

Index(['310', '159', '480', '401', '89', '158', '282', '192', '504', '458'], dtype='object')