# Retrieving Gender Information in Nouns & Adjectives using Minimum Intra-class STD + Maximum extra-class average diff
In this Notebook : attempts to isolate Word Embeddings dimension coding for **gender information** (masculine/feminine) in a sample of Word Embeddings for **NOUNS**, **ADJECTIVES** and **both**.  

## 0. Data Loading

In [1]:
import pandas as pd

# WE loading
nouns = pd.read_csv('../Data/FlauBERT_WE/all_nouns_we.csv', index_col=0).drop(columns=["number", "gender"])
nouns['adj'] = 0
adjs = pd.read_csv('../Data/FlauBERT_WE/all_adjectives_we.csv', index_col=0).drop(columns = ["number", "gender"])
adjs['adj'] = 1
verbs = pd.read_csv('../Data/FlauBERT_WE/all_verb_we.csv', index_col=0)
verbs['adj'] = 0
all_we = pd.concat([nouns, adjs, verbs])

# Normalization
normalized_data = (all_we - all_we.min())/(all_we.max() - all_we.min())

## 1. Lowest intra-class standard deviation

Separate the data into two classes: noun and not noun.

In [2]:
adj_norm = normalized_data[normalized_data["adj"] == 1]
not_adj_norm = normalized_data[normalized_data["adj"] == 0]

Lowest std for class **adjective**:

In [3]:
adj_norm.loc[:,:'511'].std().sort_values()[:10]

157    0.096814
69     0.100743
277    0.100852
441    0.100972
422    0.102595
314    0.102598
26     0.103030
209    0.103123
287    0.103896
25     0.104124
dtype: float64

Lowest std for class **not_adjective**:

In [4]:
not_adj_norm.loc[:,:'511'].std().sort_values()[:10]

157    0.099693
441    0.100581
287    0.101189
69     0.101614
277    0.101949
408    0.102070
422    0.102541
314    0.102918
209    0.103055
25     0.103689
dtype: float64

## 2. Dimensions with greatest difference between avg_noun and avg_not_noun

This seems encouraging, as few dimensions present the highest differences for each of the considered datasets. We also see higher differences in adjectives.

In [5]:
# top 10 dimensions with the highest difference
abs(adj_norm.loc[:, :'511'].mean() - not_adj_norm.loc[:, :'511'].mean()).sort_values(ascending=False)[:10]

276    0.054912
2      0.053821
478    0.049883
158    0.047296
370    0.045533
24     0.043677
220    0.043577
409    0.043102
464    0.040996
139    0.039065
dtype: float64

In [6]:
w1 = list(pd.read_csv('../Data/Dimensions/PoS/adj.csv', index_col=0).iloc[:, 0].values)

w1.extend(abs(adj_norm.loc[:, :'511'].mean() - not_adj_norm.loc[:, :'511'].mean()).sort_values(ascending=False)[:10].index)

pd.DataFrame(w1).to_csv('../Data/Dimensions/PoS/adj.csv')